hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: RowKey hashing in HBase 1.0
Date Thu, 07 May 2015 02:03:18 GMT
Jeremy, 

I think you have to be careful in how you say things. 
While over time, you’re going to get an even distribution, the hash isn’t random. Its
consistent so that hash(x) = y  and will always be the same. 
You’re taking the modulus to create 1 to n buckets. 

In each bucket, your new key is n_rowkey  where rowkey is the original row key. 

Remember that the rowkey is growing sequentially.  rowkey(n) < rowkey(n+1) …  < rowkey(n+k)


So if you hash and take its modulus and prepend it, you will still have X_rowkey(n) , X_rowkey(n+k)
, … 


All you have is N sequential lists. And again with a sequential list, you’re adding to the
right so when you split, the top section is never going to get new rows. 

I think you need to create a list  and try this with 3 or 4 buckets and you’ll start to
see what happens. 

The last region fills, but after it splits, the top half is static. The new rows are added
to the bottom half only. 

This is a problem with sequential keys that you have to learn to live with. 

Its not a killer issue, but something you need to be  aware… 

> On May 6, 2015, at 4:00 PM, jeremy p <athomewithagroovebox@gmail.com> wrote:
> 
> Thank you for the explanation, but I'm a little confused.  The key will be
> monotonically increasing, but the hash of that key will not be.
> 
> So, even though your original keys may look like : 1_foobar, 2_foobar,
> 3_foobar
> After the hashing, they'd look more like : 349000_1_foobar,
> 999999_2_foobar, 000001_3_foobar
> 
> With five regions, the original key ranges for your regions would look
> something like : 000000-199999, 200000-399999, 400000-599999,
> 600000-799999, 800000-99999
> 
> So let's say you add another row.  It causes a split.  Now your regions
> look like :  000000-199999, 200000-399999, 400000-599999, 600000-799999,
> 800000-899999, 900000-999999
> 
> Since the value that you are prepending to your keys is essentially random,
> I don't see why your regions would only fill halfway.  A new, hashed key
> would be just as likely to fall within 800000-899999 as it would be to fall
> within 900000-999999.
> 
> Are we working from different assumptions?
> 
> On Tue, May 5, 2015 at 4:46 PM, Michael Segel <michael_segel@hotmail.com>
> wrote:
> 
>> Yes, what you described  mod(hash(rowkey),n) where n is the number of
>> regions will remove the hotspotting issue.
>> 
>> However, if your key is sequential you will only have regions half full
>> post region split.
>> 
>> Look at it this way…
>> 
>> If I have a key that is a sequential count 1,2,3,4,5 … I am always adding
>> a new row to the last region and its always being added to the right.
>> (reading left from right.) Always at the end of the line…
>> 
>> So if I have 10,000 rows and I split the region… region 1 has 0 to 4,999
>> and region 2 has 5000 to 10000.
>> 
>> Now my next row is 10001, the following is 10002 … so they will be added
>> at the tail end of region 2 until it splits.  (And so on, and so on…)
>> 
>> If you take a modulus of the hash, you create n buckets. Again for each
>> bucket… I will still be adding a new larger number so it will be added to
>> the right hand side or tail of the list.
>> 
>> Once a region is split… that’s it.
>> 
>> Bucketing will solve the hot spotting issue by creating n lists of rows,
>> but you’re still always adding to the end of the list.
>> 
>> Does that make sense?
>> 
>> 
>>> On May 5, 2015, at 10:04 AM, jeremy p <athomewithagroovebox@gmail.com>
>> wrote:
>>> 
>>> Thank you for your response!
>>> 
>>> So I guess 'salt' is a bit of a misnomer.  What I used to do is this :
>>> 
>>> 1) Say that my key value is something like '1234foobar'
>>> 2) I obtain the hash of '1234foobar'.  Let's say that's '54824923'
>>> 3) I mod the hash by my number of regions.  Let's say I have 2000
>> regions.
>>> 54824923 % 2000 = 923
>>> 4) I prepend that value to my original key value, so my new key is
>>> '923_1234foobar'
>>> 
>>> Is this the same thing you were talking about?
>>> 
>>> A couple questions :
>>> 
>>> * Why would my regions only be 1/2 full?
>>> * Why would I only use this for sequential keys?  I would think this
>> would
>>> give better performance in any situation where I don't need range scans.
>>> For example, let's say my key value is a person's last name.  That will
>>> naturally cluster around certain letters, giving me an uneven
>> distribution.
>>> 
>>> --Jeremy
>>> 
>>> 
>>> 
>>> On Sun, May 3, 2015 at 11:46 AM, Michael Segel <
>> michael_segel@hotmail.com>
>>> wrote:
>>> 
>>>> Yes, don’t use a salt. Salt implies that your seed is orthogonal (read
>>>> random) to the base table row key.
>>>> You’re better off using a truncated hash (md5 is fastest) so that at
>> least
>>>> you can use a single get().
>>>> 
>>>> Common?
>>>> 
>>>> Only if your row key is mostly sequential.
>>>> 
>>>> Note that even with bucketing, you will still end up with regions only
>> 1/2
>>>> full with the only exception being the last region.
>>>> 
>>>>> On May 1, 2015, at 11:09 AM, jeremy p <athomewithagroovebox@gmail.com>
>>>> wrote:
>>>>> 
>>>>> Hello all,
>>>>> 
>>>>> I've been out of the HBase world for a while, and I'm just now jumping
>>>> back
>>>>> in.
>>>>> 
>>>>> As of HBase .94, it was still common to take a hash of your RowKey and
>>>> use
>>>>> that to "salt" the beginning of your RowKey to obtain an even
>>>> distribution
>>>>> among your region servers.  Is this still a common practice, or is
>> there
>>>> a
>>>>> better way to do this in HBase 1.0?
>>>>> 
>>>>> --Jeremy
>>>> 
>>>> The opinions expressed here are mine, while they may reflect a cognitive
>>>> thought, that is purely accidental.
>>>> Use at your own risk.
>>>> Michael Segel
>>>> michael_segel (AT) hotmail.com
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>> 
>> The opinions expressed here are mine, while they may reflect a cognitive
>> thought, that is purely accidental.
>> Use at your own risk.
>> Michael Segel
>> michael_segel (AT) hotmail.com
>> 
>> 
>> 
>> 
>> 
>> 

The opinions expressed here are mine, while they may reflect a cognitive thought, that is
purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com






Mime
View raw message