hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From highpointe <highpoint...@gmail.com>
Subject Re: Of hbase key distribution and query scalability, again.
Date Sun, 27 May 2012 04:57:07 GMT
Here is my SS:  259 71 2451

On May 26, 2012, at 9:25 AM, Michael Segel <michael_segel@hotmail.com> wrote:

> Hi,
> 
> Jumping in on this late...
> 
>>>>> To cut a long story, is the region size the only current HBase
>>>>> technique to balance load, esp. w.r.t query load? Or perhaps there are
>>>>> some more advanced techniques to do that ?
> 
> So maybe I'm missing something but I don't see the problem.
> 
> In terms of writing data to be evenly/randomly distributed, you would hash the key (md5
or SHA-1 as examples). 
> This works well if you're doing get()s and not a lot of scan()s. 
> 
> But on reads, how do you get 'hot spotting' ? 
> 
> Should those rows be cached in memory? 
> 
> So what am I missing? Besides another cup of coffee?  
> 
> -Mike
> 
> On May 25, 2012, at 1:23 PM, Ian Varley wrote:
> 
>> Yeah, I think you're right Dmitriy; there's nothing like that in HBase today as far
as I know. If it'd be useful for you, maybe it would be for others, too; work up a rough patch
and see what people think on the dev list.
>> 
>> Ian
>> 
>> On May 25, 2012, at 1:02 PM, Dmitriy Lyubimov wrote:
>> 
>>> Thanks, Ian.
>>> 
>>> I am talking about situation when even when we have uniform keys, the
>>> query distribution over them is still non-uniform and impossible to
>>> predict without sampling query skewness, but skewness is surprisingly
>>> great. (as in least active/most active user may differ in activity 100
>>> times and there is no way one could now which users are going to be
>>> active and which are going to be not active). Assuming there are few
>>> very active users, but many low active users, if two active users get
>>> into the same region, it creates a hotspot which could have been
>>> avoided if region balancer took notions of number of hits the regions
>>> are getting recently.
>>> 
>>> Like i pointed out before, such skewness balancer could be fairly
>>> easily implemented externally to hbase (as in TotalOrderPartitioner),
>>> with exception that it would be interfering with the Hbase's balancer
>>> itself so it must be integrated with the balancer in that case.
>>> 
>>> Also another distinct problem is time parameters of such balance
>>> controller. The load may be changing fast enough or slow enough so
>>> that sampling must be time-weighted itself.
>>> 
>>> All these tehchnicalities make it difficult to implement it outside
>>> hbase or use key manipulation (as dynamic nature makes it difficult to
>>> deal with key re-assigning to match newly discovered load
>>> distribution).
>>> 
>>> Ok I guess there's nothing in HBase like that right now otherwise i
>>> would've seen it in the book i suppose...
>>> 
>>> Thanks.
>>> -d
>>> 
>>> On Fri, May 25, 2012 at 10:42 AM, Ian Varley <ivarley@salesforce.com> wrote:
>>>> Dmitriy,
>>>> 
>>>> If I understand you right, what you're asking about might be called "Read
Hotspotting". For an obvious example, if I distribute my data nicely over the cluster but
then say:
>>>> 
>>>> for (int x = 0; x < 10000000000; x++) {
>>>> htable.get(new Get(Bytes.toBytes("row1")));
>>>> }
>>>> 
>>>> Then naturally I'm only putting read load on the region server that hosts
"row1". That's contrived, of course, you'd never really do that. But I can imagine plenty
of situations where there's an imbalance in query load w/r/t the leading part of the row key
of a table. It's not fundamentally different from "write hotspotting", except that it's probably
less common (it happens frequently in writes because ascending data in a time series or number
sequence is a common thing to insert into a database).
>>>> 
>>>> I guess the simple answer is, if you know of non-even distribution of read
patterns, it might be something to consider in a custom partitioning of the data into regions.
I don't know of any other technique (short of some external caching mechanism) that'd alleviate
this; at base, you still have to ask exactly one RS for any given piece of data.
>>>> 
>>>> Ian
>>>> 
>>>> On May 25, 2012, at 12:31 PM, Dmitriy Lyubimov wrote:
>>>> 
>>>>> Hello,
>>>>> 
>>>>> I'd like to collect opinions from HBase experts on the query
>>>>> uniformity and whether there's any advance technique currently exists
>>>>> in HBase to cope with the problems of query uniformity beyond just
>>>>> maintaining the key uniform distribution.
>>>>> 
>>>>> I know we start with the statement that in order to scale queries, we
>>>>> need them uniformly distributed over key space. The next advice people
>>>>> get is to use uniformly distributed key. Then, the thinking goes, the
>>>>> query load will also be uniformly distributed among regions.
>>>>> 
>>>>> For what seems to be an embarassingly long time i was missing the
>>>>> point however that using uniformly distributed keys does not equate
>>>>> uniform distribution of the queries since it doesn't account for
>>>>> skewness of queries over the key space itself. This skewness can be
>>>>> bad enough under some circumstances to create query hot spots in the
>>>>> cluster which could have been avoided should region splits were
>>>>> balanced based on query loads rather than on a data size per se. (sort
>>>>> of dynamic query distribution sampling in order to equalize the load
>>>>> similar to how TotalOrderPartitioner does random data sampling to
>>>>> build distribution of the key skewness in the incoming data).
>>>>> 
>>>>> To cut a long story, is the region size the only current HBase
>>>>> technique to balance load, esp. w.r.t query load? Or perhaps there are
>>>>> some more advanced techniques to do that ?
>>>>> 
>>>>> Thank you very much.
>>>>> -Dmitriy
>>>> 
>> 
>> 
> 

Mime
View raw message