hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: Prefix salting pattern
Date Mon, 19 May 2014 07:24:10 GMT
> This is even better when you don't necessary care about the
> order of every row, but want every row in a given range (then you can
> just get whatever row is available from a buffer in the client).

You do realize that in the general case you want to return the result set in sort order. 
So you will have to put the resulting range scans in sort order. 

If you’re saying that you don’t care about the order of the row sets… then why are you
using a sequential row key which causes hot spotting in the first place? 


On May 18, 2014, at 9:19 PM, Mike Axiak <mike@axiak.net> wrote:

> In our measurements, scanning is improved by performing against n
> range scans rather than 1 (since you are effectively striping the
> reads). This is even better when you don't necessary care about the
> order of every row, but want every row in a given range (then you can
> just get whatever row is available from a buffer in the client).
> 
> -Mike
> 
> On Sun, May 18, 2014 at 1:07 PM, Michael Segel
> <michael_segel@hotmail.com> wrote:
>> No, you’re missing the point.
>> Its not a good idea or design.
>> 
>> Is your data mutable or static?
>> 
>> To your point. Everytime you want to do a simple get() you have to open up n get()
statements. On your range scans you will have to do n range scans, then join and sort the
result sets. The fact that each result set is in sort order will help a little, but still
not that clean.
>> 
>> 
>> 
>> On May 18, 2014, at 4:58 PM, Software Dev <static.void.dev@gmail.com> wrote:
>> 
>>> You may be missing the point. The primary reason for the salt prefix
>>> pattern is to avoid hotspotting when inserting time series data AND at
>>> the same time provide a way to perform range scans.
>>> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
>>> 
>>>> NOTE:  Many people worry about hot spotting when they really don’t have
to do so. Hot spotting that occurs on a the initial load of a table is .OK. Its when you have
a sequential row key that you run in to problems with hot spotting and regions being only
half filled.
>>> 
>>> The data being inserted will be a constant stream of time ordered data
>>> so yes, hotspotting will be an issue
>>> 
>>>> Adding a random value to give you a bit of randomness now means that you
can’t do a range scan..
>>> 
>>> That's not accurate. To perform a range scan you would just need to
>>> open up N scanners where N is the size of the buckets/random prefixes
>>> used.
>>> 
>>>> Don’t take the modulo, just truncate to the first byte.  Taking the modulo
is again a dumb idea, but not as dumb as using a salt.
>>> 
>>> Well the only reason why I would think using a salt would be
>>> beneficial is to limit the number of scanners when performing a range
>>> scan. See above comment. And yes, performing a range scan will be our
>>> primary read pattern.
>>> 
>>> On Sun, May 18, 2014 at 2:36 AM, Michael Segel
>>> <michael_segel@hotmail.com> wrote:
>>>> I think I should dust off my schema design talk… clearly the talks given
by some of the vendors don’t really explain things …
>>>> (Hmmm. Strata London?)
>>>> 
>>>> See my reply below…. Note I used SHA-1. MD-5 should also give you roughly
the same results.
>>>> 
>>>> On May 18, 2014, at 4:28 AM, Software Dev <static.void.dev@gmail.com>
wrote:
>>>> 
>>>>> I recently came across the pattern of adding a salting prefix to the
>>>>> row keys to prevent hotspotting. Still trying to wrap my head around
>>>>> it and I have a few questions.
>>>>> 
>>>> 
>>>> If you add a salt, you’re prepending a random number to a row in order
to avoid hot spotting.  It amazes me that Sematext never went back and either removed the
blog or fixed it and now the bad idea is getting propagated.  Adding a random value to give
you a bit of randomness now means that you can’t do a range scan, or fetch the specific
row with a single get()  so you’re going to end up boiling the ocean to get your data. You’re
better off using hive/spark/shark than hbase.
>>>> 
>>>> As James tries to point out, you take the hash of the row so that you can
easily retrieve the value. But rather than prepend a 160 bit hash, you can easily achieve
the same thing by just truncating the hash to the first byte in order to get enough randomness
to avoid hot spotting. Of course, the one question you should ask is why don’t you just
take the hash as the row key and then have a 160 bit row key (40 bytes in length)? Then store
the actual key as a column in the table.
>>>> 
>>>> And then there’s a bigger question… why are you worried about hot spotting?
Are you adding rows where the row key is sequential?  Or are you worried about when you first
start loading rows, that you are hot spotting, but the underlying row key is random enough
that once the first set of rows are added, HBase splitting regions will be enough?
>>>> 
>>>>> - Is there ever a reason to salt to more buckets than there are region
>>>>> servers? The only reason why I think that may be beneficial is to
>>>>> anticipate future growth???
>>>>> 
>>>> Doesn’t matter.
>>>> Think about how HBase splits regions.
>>>> Don’t take the modulo, just truncate to the first byte.  Taking the modulo
is again a dumb idea, but not as dumb as using a salt.
>>>> 
>>>> Keep in mind that the first byte of the hash is going to be 0-f in a character
representation. (4 bits of the 160bit key)  So you have 16 values to start with.
>>>> That should be enough.
>>>> 
>>>>> - Is it beneficial to always hash against a known number of buckets
>>>>> (ie never change the size) that way for any individual row key you can
>>>>> always determine the prefix?
>>>>> 
>>>> Your question doesn’t make sense.
>>>> 
>>>>> - Are there any good use cases of this pattern out in the wild?
>>>>> 
>>>> Yup.
>>>> Deduping data sets.
>>>> 
>>>>> Thanks
>>>>> 
>>>> NOTE:  Many people worry about hot spotting when they really don’t have
to do so. Hot spotting that occurs on a the initial load of a table is OK. Its when you have
a sequential row key that you run in to problems with hot spotting and regions being only
half filled.
>>>> 
>>> 
>> 
> 


Mime
View raw message