Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of ivarley@salesforce.com
 designates 64.18.3.89 as permitted sender)
From: Ian Varley <ivarley@salesforce.com>
To: "user@hbase.apache.org" <user@hbase.apache.org>
Date: Fri, 25 May 2012 11:23:41 -0700
Subject: Re: Of hbase key distribution and query scalability, again.
Thread-Topic: Of hbase key distribution and query scalability, again.
Thread-Index: Ac06o4NqmbOZUtsWSCGZR5UhiNg9CQ==
Message-ID: <6761AA59-9A96-4478-80A1-234048C9DEE3@salesforce.com>
References: 
 <CAPud8Trmg-OVN0RjOZQOxYAN36YD7A4Zo2qV=tpxhZ3mFcwn4A@mail.gmail.com>
 <63453E02-0D13-44CA-8086-7A00455F107C@salesforce.com>
 <CAPud8TprgM6qiuqqFa6n5Jg_u24Qs2USZYucUrAS-LYRyvJBwA@mail.gmail.com>
In-Reply-To: 
 <CAPud8TprgM6qiuqqFa6n5Jg_u24Qs2USZYucUrAS-LYRyvJBwA@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Yeah, I think you're right Dmitriy; there's nothing like that in HBase toda=
y as far as I know. If it'd be useful for you, maybe it would be for others=
, too; work up a rough patch and see what people think on the dev list.

Ian

On May 25, 2012, at 1:02 PM, Dmitriy Lyubimov wrote:

> Thanks, Ian.
>=20
> I am talking about situation when even when we have uniform keys, the
> query distribution over them is still non-uniform and impossible to
> predict without sampling query skewness, but skewness is surprisingly
> great. (as in least active/most active user may differ in activity 100
> times and there is no way one could now which users are going to be
> active and which are going to be not active). Assuming there are few
> very active users, but many low active users, if two active users get
> into the same region, it creates a hotspot which could have been
> avoided if region balancer took notions of number of hits the regions
> are getting recently.
>=20
> Like i pointed out before, such skewness balancer could be fairly
> easily implemented externally to hbase (as in TotalOrderPartitioner),
> with exception that it would be interfering with the Hbase's balancer
> itself so it must be integrated with the balancer in that case.
>=20
> Also another distinct problem is time parameters of such balance
> controller. The load may be changing fast enough or slow enough so
> that sampling must be time-weighted itself.
>=20
> All these tehchnicalities make it difficult to implement it outside
> hbase or use key manipulation (as dynamic nature makes it difficult to
> deal with key re-assigning to match newly discovered load
> distribution).
>=20
> Ok I guess there's nothing in HBase like that right now otherwise i
> would've seen it in the book i suppose...
>=20
> Thanks.
> -d
>=20
> On Fri, May 25, 2012 at 10:42 AM, Ian Varley <ivarley@salesforce.com> wro=
te:
>> Dmitriy,
>>=20
>> If I understand you right, what you're asking about might be called "Rea=
d Hotspotting". For an obvious example, if I distribute my data nicely over=
 the cluster but then say:
>>=20
>> for (int x =3D 0; x < 10000000000; x++) {
>>   htable.get(new Get(Bytes.toBytes("row1")));
>> }
>>=20
>> Then naturally I'm only putting read load on the region server that host=
s "row1". That's contrived, of course, you'd never really do that. But I ca=
n imagine plenty of situations where there's an imbalance in query load w/r=
/t the leading part of the row key of a table. It's not fundamentally diffe=
rent from "write hotspotting", except that it's probably less common (it ha=
ppens frequently in writes because ascending data in a time series or numbe=
r sequence is a common thing to insert into a database).
>>=20
>> I guess the simple answer is, if you know of non-even distribution of re=
ad patterns, it might be something to consider in a custom partitioning of =
the data into regions. I don't know of any other technique (short of some e=
xternal caching mechanism) that'd alleviate this; at base, you still have t=
o ask exactly one RS for any given piece of data.
>>=20
>> Ian
>>=20
>> On May 25, 2012, at 12:31 PM, Dmitriy Lyubimov wrote:
>>=20
>>> Hello,
>>>=20
>>> I'd like to collect opinions from HBase experts on the query
>>> uniformity and whether there's any advance technique currently exists
>>> in HBase to cope with the problems of query uniformity beyond just
>>> maintaining the key uniform distribution.
>>>=20
>>> I know we start with the statement that in order to scale queries, we
>>> need them uniformly distributed over key space. The next advice people
>>> get is to use uniformly distributed key. Then, the thinking goes, the
>>> query load will also be uniformly distributed among regions.
>>>=20
>>> For what seems to be an embarassingly long time i was missing the
>>> point however that using uniformly distributed keys does not equate
>>> uniform distribution of the queries since it doesn't account for
>>> skewness of queries over the key space itself. This skewness can be
>>> bad enough under some circumstances to create query hot spots in the
>>> cluster which could have been avoided should region splits were
>>> balanced based on query loads rather than on a data size per se. (sort
>>> of dynamic query distribution sampling in order to equalize the load
>>> similar to how TotalOrderPartitioner does random data sampling to
>>> build distribution of the key skewness in the incoming data).
>>>=20
>>> To cut a long story, is the region size the only current HBase
>>> technique to balance load, esp. w.r.t query load? Or perhaps there are
>>> some more advanced techniques to do that ?
>>>=20
>>> Thank you very much.
>>> -Dmitriy
>>=20