hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark <static.void....@gmail.com>
Subject Re: Region Splits
Date Mon, 21 Nov 2011 16:06:13 GMT
As far as rowkey length goes, should it be a concern that we are now 
adding 16 bytes to each key? Would it be sufficient to take say the 
first 4 bytes of the MD5 hash?

On 11/21/11 7:55 AM, Mark wrote:
> Damn, I was hoping my understanding was flawed.
>
> In your example I am guessing the addition of old_key suffix is to 
> prevent against any possible collision. Is that correct?
>
> On 11/20/11 9:39 PM, Nicolas Spiegelberg wrote:
>> Sequential writes are also an argument for pre-splitting and using hash
>> prefixing.  In other words, presplit your table into N regions 
>> instead of
>> the default of 1&  transform your keys into:
>>
>> new_key = md5(old_key) + old_key
>>
>> Using this method your sequential writes under the old_key are now 
>> spread
>> evenly across all regions.  There are some limitations to hash 
>> prefixing,
>> such as non-sequential scans across row boundaries.  However, it's a
>> tradeoff between even distribution&  advanced query options.
>>
>> On 11/20/11 7:54 PM, "Amandeep Khurana"<amansk@gmail.com>  wrote:
>>
>>> Mark,
>>>
>>> Yes, your understanding is correct. If your keys are sequential
>>> (timestamps
>>> etc), you will always be writing to the end of the table and "older"
>>> regions will not get any writes. This is one of the arguments against
>>> using
>>> sequential keys.
>>>
>>> -ak
>>>
>>> On Sun, Nov 20, 2011 at 11:33 AM, Mark<static.void.dev@gmail.com>  
>>> wrote:
>>>
>>>> Say we have a use case that has sequential row keys and we have rows
>>>> 0-100. Let's assume that 100 rows = the split size. Now when there 
>>>> is a
>>>> split it will split at the halfway mark so there will be two 
>>>> regions as
>>>> follows:
>>>>
>>>> Region1 [START-49]
>>>> Region2 [50-END]
>>>>
>>>> So now at this point all inserts will be writing to Region2 only
>>>> correct?
>>>> Now at some point Region2 will need to split and it will look like the
>>>> following before the split:
>>>>
>>>> Region1 [START-49]
>>>> Region2 [50-150]
>>>>
>>>> After the split it will look like:
>>>>
>>>> Region1 [START-49]
>>>> Region2 [50-100]
>>>> Region3 [150-END]
>>>>
>>>> And this pattern will continue correct? My question is when there is a
>>>> use
>>>> case that has sequential keys how would any of the older regions every
>>>> receive anymore writes? It seems like they would always be stuck at
>>>> MaxRegionSize/2. Can someone please confirm or clarify this issue?
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>>
>>>>

Mime
View raw message