hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Of hbase key distribution and query scalability, again.
Date Fri, 25 May 2012 18:02:50 GMT
Thanks, Ian.

I am talking about situation when even when we have uniform keys, the
query distribution over them is still non-uniform and impossible to
predict without sampling query skewness, but skewness is surprisingly
great. (as in least active/most active user may differ in activity 100
times and there is no way one could now which users are going to be
active and which are going to be not active). Assuming there are few
very active users, but many low active users, if two active users get
into the same region, it creates a hotspot which could have been
avoided if region balancer took notions of number of hits the regions
are getting recently.

Like i pointed out before, such skewness balancer could be fairly
easily implemented externally to hbase (as in TotalOrderPartitioner),
with exception that it would be interfering with the Hbase's balancer
itself so it must be integrated with the balancer in that case.

Also another distinct problem is time parameters of such balance
controller. The load may be changing fast enough or slow enough so
that sampling must be time-weighted itself.

All these tehchnicalities make it difficult to implement it outside
hbase or use key manipulation (as dynamic nature makes it difficult to
deal with key re-assigning to match newly discovered load

Ok I guess there's nothing in HBase like that right now otherwise i
would've seen it in the book i suppose...


On Fri, May 25, 2012 at 10:42 AM, Ian Varley <ivarley@salesforce.com> wrote:
> Dmitriy,
> If I understand you right, what you're asking about might be called "Read Hotspotting".
For an obvious example, if I distribute my data nicely over the cluster but then say:
> for (int x = 0; x < 10000000000; x++) {
>   htable.get(new Get(Bytes.toBytes("row1")));
> }
> Then naturally I'm only putting read load on the region server that hosts "row1". That's
contrived, of course, you'd never really do that. But I can imagine plenty of situations where
there's an imbalance in query load w/r/t the leading part of the row key of a table. It's
not fundamentally different from "write hotspotting", except that it's probably less common
(it happens frequently in writes because ascending data in a time series or number sequence
is a common thing to insert into a database).
> I guess the simple answer is, if you know of non-even distribution of read patterns,
it might be something to consider in a custom partitioning of the data into regions. I don't
know of any other technique (short of some external caching mechanism) that'd alleviate this;
at base, you still have to ask exactly one RS for any given piece of data.
> Ian
> On May 25, 2012, at 12:31 PM, Dmitriy Lyubimov wrote:
>> Hello,
>> I'd like to collect opinions from HBase experts on the query
>> uniformity and whether there's any advance technique currently exists
>> in HBase to cope with the problems of query uniformity beyond just
>> maintaining the key uniform distribution.
>> I know we start with the statement that in order to scale queries, we
>> need them uniformly distributed over key space. The next advice people
>> get is to use uniformly distributed key. Then, the thinking goes, the
>> query load will also be uniformly distributed among regions.
>> For what seems to be an embarassingly long time i was missing the
>> point however that using uniformly distributed keys does not equate
>> uniform distribution of the queries since it doesn't account for
>> skewness of queries over the key space itself. This skewness can be
>> bad enough under some circumstances to create query hot spots in the
>> cluster which could have been avoided should region splits were
>> balanced based on query loads rather than on a data size per se. (sort
>> of dynamic query distribution sampling in order to equalize the load
>> similar to how TotalOrderPartitioner does random data sampling to
>> build distribution of the key skewness in the incoming data).
>> To cut a long story, is the region size the only current HBase
>> technique to balance load, esp. w.r.t query load? Or perhaps there are
>> some more advanced techniques to do that ?
>> Thank you very much.
>> -Dmitriy

View raw message