hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Damien Hardy <dha...@viadeoteam.com>
Subject Re: Is it necessary to set MD5 on rowkey?
Date Tue, 17 Dec 2013 10:19:38 GMT
Using a custom InputFormat with dedicated getsplit() allow you to use a
single scan object when initiating job. It is cloned later by each
mapper setting startrow and stoprow according the list returned by
getsplit().

Getsplit would return a list of couple (startrow, stoprow) calculated
based on regions and slat. Actually in HBaseW it runs the getsplit from
TabelInputFormat for each salt value.

The size of the list returned by getsplit() determine the number of mappers.

Best regards,

-- 
Damien

Le 17/12/2013 10:57, bigdata a écrit :
> Thanks for your reply, Damien.
> So this solution still use one scan object, and sent it to initTableMapperJob?
> Does modified getsplit() function set the salt Bucket number to the number of mapper?If
I set 256 salt buckets, and the mapper number will be 256, right?
> 
> Another question is can this bucket number be changed? Like 16 at first, when data became
large, I expand it to 32, or 64? 
> Thanks.
> 
>> Date: Tue, 17 Dec 2013 10:21:13 +0100
>> From: dhardy@viadeoteam.com
>> To: user@hbase.apache.org
>> Subject: Re: Is it necessary to set MD5 on rowkey?
>>
>> Hello,
>>
>> yes you need 256 scans range or a full (almost) scan with combination of
>> filters for each 256 ranges
>> (https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FilterList.Operator.html#MUST_PASS_ONE)
>>
>> For mapreduce, the getsplit() method should be modified from
>> TableInputFormatBase to handle salt values.
>> This is what is done in
>> https://github.com/sematext/HBaseWD/blob/master/src/main/java/com/sematext/hbase/wd/WdTableInputFormat.java
>> (to return on HBaseWD example)
>>
>> So a mapper (several if a salt value cover many regions) is dedicated
>> for each salt value like simple TableInoutFormart would do without salt.
>>
>> Best regards.
>>
>> -- 
>> Damien
>>
>>
>> Le 17/12/2013 09:36, bigdata a écrit :
>>> Hello,
>>> @Alex Baranau
>>> Thanks for your salt solution. In my understanding, the salt solution is divide
the data into several partial(if 2 letters,00~FF, then 255 parts will be devided). My question
is when I want to scan data, do I need scan 256 times for the following situation:rowkey:
 salt prefix (00~FF) + date + xxx
>>> And If I want do mapreduce on this table, if the initTableMapperJob(List<Scan>,...)
is OK?
>>> If example of scan the salted table is appreciated!
>>> Thanks.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>> Date: Tue, 18 Dec 2012 12:12:37 -0500
>>>> Subject: Re: Is it necessary to set MD5 on rowkey?
>>>> From: alex.baranov.v@gmail.com
>>>> To: user@hbase.apache.org
>>>>
>>>> Hello,
>>>>
>>>> @Mike:
>>>>
>>>> I'm the author of that post :).
>>>>
>>>> Quick reply to your last comment:
>>>>
>>>> 1) Could you please describe why "the use of a 'Salt' is a very, very bad
>>>> idea" in more specific way than "Fetching data takes more effort". Would
be
>>>> helpful for anyone who is looking into using this approach.
>>>>
>>>> 2) The approach described in the post also says you can prefix with the
>>>> hash, you probably missed that.
>>>>
>>>> 3) I believe your answer, "use MD5 or SHA-1" doesn't help bigdata guy.
>>>> Please re-read the question: the intention is to distribute the load while
>>>> still being able to do "partial key scans". The blog post linked above
>>>> explains one possible solution for that, while your answer doesn't.
>>>>
>>>> @bigdata:
>>>>
>>>> Basically when it comes to solving two issues: distributing writes and
>>>> having ability to read data sequentially, you have to balance between being
>>>> good at both of them. Very good presentation by Lars:
>>>> http://www.slideshare.net/larsgeorge/hbase-advanced-schema-design-berlin-buzzwords-june-2012,
>>>> slide 22. You will see how this is correlated. In short:
>>>> * having md5/other hash prefix of the key does better w.r.t. distributing
>>>> writes, while compromises ability to do range scans efficiently
>>>> * having very limited number of 'salt' prefixes still allows to do range
>>>> scans (less efficiently than normal range scans, of course, but still good
>>>> enough in many cases) while providing worse distribution of writes
>>>>
>>>> In the latter case by choosing number of possible 'salt' prefixes (which
>>>> could be derived from hashed values, etc.) you can balance between
>>>> distributing writes efficiency and ability to run fast range scans.
>>>>
>>>> Hope this helps
>>>>
>>>> Alex Baranau
>>>> ------
>>>> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
>>>> Solr
>>>>
>>>> On Tue, Dec 18, 2012 at 8:52 AM, Michael Segel <michael_segel@hotmail.com>wrote:
>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> First, the use of a 'Salt' is a very, very bad idea and I would really
>>>>> hope that the author of that blog take it down.
>>>>> While it may solve an initial problem in terms of region hot spotting,
it
>>>>> creates another problem when it comes to fetching data. Fetching data
takes
>>>>> more effort.
>>>>>
>>>>> With respect to using a hash (MD5 or SHA-1) you are creating a more random
>>>>> key that is unique to the record.  Some would argue that using MD5 or
SHA-1
>>>>> that mathematically you could have a collision, however you could then
>>>>> append the key to the hash to guarantee uniqueness. You could also do
>>>>> things like take the hash and then truncate it to the first byte and
then
>>>>> append the record key. This should give you enough randomness to avoid
hot
>>>>> spotting after the initial region completion and you could pre-split
out
>>>>> any number of regions. (First byte 0-255 for values, so you can program
the
>>>>> split...
>>>>>
>>>>>
>>>>> Having said that... yes, you lose the ability to perform a sequential
scan
>>>>> of the data.  At least to a point.  It depends on your schema.
>>>>>
>>>>> Note that you need to think about how you are primarily going to access
>>>>> the data.  You can then determine the best way to store the data to gain
>>>>> the best performance. For some applications... the region hot spotting
>>>>> isn't an important issue.
>>>>>
>>>>> Note YMMV
>>>>>
>>>>> HTH
>>>>>
>>>>> -Mike
>>>>>
>>>>> On Dec 18, 2012, at 3:33 AM, Damien Hardy <dhardy@viadeoteam.com>
wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> There is middle term betwen sequecial keys (hot spoting risk) and
md5
>>>>>> (heavy scan):
>>>>>>  * you can use composed keys with a field that can segregate data
>>>>>> (hostname, productname, metric name) like OpenTSDB
>>>>>>  * or use Salt with a limited number of values (example
>>>>>> substr(md5(rowid),0,1) = 16 values)
>>>>>>    so that a scan is a combination of 16 filters on on each salt
values
>>>>>>    you can base your code on HBaseWD by sematext
>>>>>>
>>>>>>
>>>>> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
>>>>>>       https://github.com/sematext/HBaseWD
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>>
>>>>>> 2012/12/18 bigdata <bigdatabase@outlook.com>
>>>>>>
>>>>>>> Many articles tell me that MD5 rowkey or part of it is good method
to
>>>>>>> balance the records stored in different parts. But If I want
to search
>>>>> some
>>>>>>> sequential rowkey records, such as date as rowkey or partially.
I can
>>>>> not
>>>>>>> use rowkey filter to scan a range of date value one time on the
date by
>>>>>>> MD5. How to balance this issue?
>>>>>>> Thanks.
>>
>  		 	   		  
> 

-- 
Damien HARDY
IT Infrastructure Architect
Viadeo - 30 rue de la Victoire - 75009 Paris - France
PGP : 45D7F89A


Mime
View raw message