hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lars hofhansl <lhofha...@yahoo.com>
Subject Re: Is it necessary to set MD5 on rowkey?
Date Wed, 19 Dec 2012 18:37:57 GMT
I would disagree here.
It depends on what you are doing and blanket statements about "this is very, very bad" typically
do not help.

Salting (even round robin) is very nice to distribute write load *and* it gives you a natural
way to parallelize scans assuming scans are of reasonable size.

If the typical use case is point gets then hashing or inverting keys would be preferable.
As usual: It depends.

-- Lars

 From: Michael Segel <michael_segel@hotmail.com>
To: user@hbase.apache.org 
Sent: Tuesday, December 18, 2012 3:29 PM
Subject: Re: Is it necessary to set MD5 on rowkey?
And that's the point. Salt as you explain it conceptually implies that the number you are
adding to the key to ensure a better distribution means that you will have inefficiencies
in terms of scans and gets. 

Using a hash as either the full key, or taking the hash, truncating it and appending the key
may screw up scans, but your get() is intact. 

There are other options like inverting the numeric key ... 

And of course doing nothing. 

Using a salt as part of the design pattern is bad. 

With respect to the OP, I was discussing the use of hash and some alternatives to how to implement
the hash of a key. 
Again, doing nothing may also make sense too, if you understand the risks and you know how
your data is going to be used.

On Dec 18, 2012, at 11:36 AM, Alex Baranau <alex.baranov.v@gmail.com> wrote:

> Mike,
> Please read *full post* before judge. In particular, "Hash-based
> distribution" section. You can find the same in HBaseWD small README file
> [1] (not sure if you read it at all before commenting on the lib). Round
> robin is mainly for explaining the concept/idea (though not only for that).
> Thank you,
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
> [1] https://github.com/sematext/HBaseWD
> On Tue, Dec 18, 2012 at 12:24 PM, Michael Segel
> <michael_segel@hotmail.com>wrote:
>> Quick answer...
>> Look at the salt.
>> Its just a number from a round robin counter.
>> There is no tie between the salt and row.
>> So when you want to fetch a single row, how do you do it?
>> ...
>> ;-)
>> On Dec 18, 2012, at 11:12 AM, Alex Baranau <alex.baranov.v@gmail.com>
>> wrote:
>>> Hello,
>>> @Mike:
>>> I'm the author of that post :).
>>> Quick reply to your last comment:
>>> 1) Could you please describe why "the use of a 'Salt' is a very, very bad
>>> idea" in more specific way than "Fetching data takes more effort". Would
>> be
>>> helpful for anyone who is looking into using this approach.
>>> 2) The approach described in the post also says you can prefix with the
>>> hash, you probably missed that.
>>> 3) I believe your answer, "use MD5 or SHA-1" doesn't help bigdata guy.
>>> Please re-read the question: the intention is to distribute the load
>> while
>>> still being able to do "partial key scans". The blog post linked above
>>> explains one possible solution for that, while your answer doesn't.
>>> @bigdata:
>>> Basically when it comes to solving two issues: distributing writes and
>>> having ability to read data sequentially, you have to balance between
>> being
>>> good at both of them. Very good presentation by Lars:
>> http://www.slideshare.net/larsgeorge/hbase-advanced-schema-design-berlin-buzzwords-june-2012
>> ,
>>> slide 22. You will see how this is correlated. In short:
>>> * having md5/other hash prefix of the key does better w.r.t. distributing
>>> writes, while compromises ability to do range scans efficiently
>>> * having very limited number of 'salt' prefixes still allows to do range
>>> scans (less efficiently than normal range scans, of course, but still
>> good
>>> enough in many cases) while providing worse distribution of writes
>>> In the latter case by choosing number of possible 'salt' prefixes (which
>>> could be derived from hashed values, etc.) you can balance between
>>> distributing writes efficiency and ability to run fast range scans.
>>> Hope this helps
>>> Alex Baranau
>>> ------
>>> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
>> -
>>> Solr
>>> On Tue, Dec 18, 2012 at 8:52 AM, Michael Segel <
>> michael_segel@hotmail.com>wrote:
>>>> Hi,
>>>> First, the use of a 'Salt' is a very, very bad idea and I would really
>>>> hope that the author of that blog take it down.
>>>> While it may solve an initial problem in terms of region hot spotting,
>> it
>>>> creates another problem when it comes to fetching data. Fetching data
>> takes
>>>> more effort.
>>>> With respect to using a hash (MD5 or SHA-1) you are creating a more
>> random
>>>> key that is unique to the record.  Some would argue that using MD5 or
>> SHA-1
>>>> that mathematically you could have a collision, however you could then
>>>> append the key to the hash to guarantee uniqueness. You could also do
>>>> things like take the hash and then truncate it to the first byte and
>> then
>>>> append the record key. This should give you enough randomness to avoid
>> hot
>>>> spotting after the initial region completion and you could pre-split out
>>>> any number of regions. (First byte 0-255 for values, so you can program
>> the
>>>> split...
>>>> Having said that... yes, you lose the ability to perform a sequential
>> scan
>>>> of the data.  At least to a point.  It depends on your schema.
>>>> Note that you need to think about how you are primarily going to access
>>>> the data.  You can then determine the best way to store the data to gain
>>>> the best performance. For some applications... the region hot spotting
>>>> isn't an important issue.
>>>> Note YMMV
>>>> HTH
>>>> -Mike
>>>> On Dec 18, 2012, at 3:33 AM, Damien Hardy <dhardy@viadeoteam.com>
>> wrote:
>>>>> Hello,
>>>>> There is middle term betwen sequecial keys (hot spoting risk) and md5
>>>>> (heavy scan):
>>>>> * you can use composed keys with a field that can segregate data
>>>>> (hostname, productname, metric name) like OpenTSDB
>>>>> * or use Salt with a limited number of values (example
>>>>> substr(md5(rowid),0,1) = 16 values)
>>>>>  so that a scan is a combination of 16 filters on on each salt values
>>>>>  you can base your code on HBaseWD by sematext
>> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
>>>>>    https://github.com/sematext/HBaseWD
>>>>> Cheers,
>>>>> 2012/12/18 bigdata <bigdatabase@outlook.com>
>>>>>> Many articles tell me that MD5 rowkey or part of it is good method
>>>>>> balance the records stored in different parts. But If I want to search
>>>> some
>>>>>> sequential rowkey records, such as date as rowkey or partially. I
>>>> not
>>>>>> use rowkey filter to scan a range of date value one time on the date
>> by
>>>>>> MD5. How to balance this issue?
>>>>>> Thanks.
>>>>> --
>>>>> Damien HARDY
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message