hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lars hofhansl <lhofha...@yahoo.com>
Subject Re: Regarding rowkey
Date Wed, 12 Sep 2012 19:03:41 GMT
I attempted to write this up here: http://hadoop-hbase.blogspot.com/2011/12/introduction-to-hbase.html



----- Original Message -----
From: Ramasubramanian <ramasubramanian.narayanan@gmail.com>
To: "user@hbase.apache.org" <user@hbase.apache.org>
Cc: Michael Segel <michael_segel@hotmail.com>; "user@hbase.apache.org" <user@hbase.apache.org>
Sent: Wednesday, September 12, 2012 11:43 AM
Subject: Re: Regarding rowkey

Hi All,

Can someone pls explain me layman term what rowkey and how to get the rowkey(in case of hash
map) to load data faster into hbase. 

Regards,
Rams

On 12-Sep-2012, at 10:40 PM, lars hofhansl <lhofhansl@yahoo.com> wrote:

> Not insisting :)
> MD5 and SHA-1 would be reasonable and can be used to replace the key as you say.
> 
> 
> 
> ----- Original Message -----
> From: Michael Segel <michael_segel@hotmail.com>
> To: user@hbase.apache.org; lars hofhansl <lhofhansl@yahoo.com>
> Cc: 
> Sent: Wednesday, September 12, 2012 9:49 AM
> Subject: Re: Regarding rowkey
> 
> MD5 should work, SHA-1 while theoretically may have a collision, it hasn't been found.

> Then there's SHA-2...
> 
> I don't disagree with your assertion, however... it causes the key to be longer that
it should have to be. 
> 
> If you insist on doing this... then take the MD5 hash, truncate it to 4 bytes and prepend
it to your key.  
> 
> Just saying.
> 
> -Mike
> 
> On Sep 12, 2012, at 10:25 AM, lars hofhansl <lhofhansl@yahoo.com> wrote:
> 
>> If you use a collision free hashing algorithm you're right. Otherwise you'd KVs suddenly
grouped into rows that weren't part of the same row.
>> 
>> 
>> With hash prefixing you can use a fast and simple hashing algorithm, because you
do not need the hash to be unique.
>> 
>> Depends again on various aspects.
>> 
>> 
>> 
>> ----- Original Message -----
>> From: Michael Segel <michael_segel@hotmail.com>
>> To: user@hbase.apache.org; lars hofhansl <lhofhansl@yahoo.com>
>> Cc: 
>> Sent: Wednesday, September 12, 2012 5:46 AM
>> Subject: Re: Regarding rowkey
>> 
>> I wouldn't 'prefix' the hash to the key, but actually replace the key with a hash
and store the unhashed key in a column. 
>> 
>> But that's a different discussion. 
>> 
>> In a nutshell, the problem is that there are a lot of potential use cases where you
want to store data in a sequence dependent fashion. So you will get a continual hotspot and
half full regions. 
>> 
>> Assuming that the underlying data is much larger than the key,  it may be better
to hash the row key and then using coprocessors create a secondary sequential index of the
initial key. 
>> 
>> The advantages are that you will have far more rows within the secondary index table
before a split occurs and that there may be ways of controlling the writes to the index such
that it may have less of an impact on the overall performance. (I don't know, I haven't had
time to play with this idea... yet) 
>> 
>> There are other options, at least in theory... and these would also be use case specific.
>> 
>> Just remember TANSTAAFL* applies. 
>> 
>> -Mike
>> 
>> * There Aint No Such Thing As A Free Lunch - Larry Niven.
>> 
>> On Sep 11, 2012, at 10:08 PM, lars hofhansl <lhofhansl@yahoo.com> wrote:
>> 
>>> It depends. If you do not need to perform rangescans along (prefixes of) your
row keys, you can prefix the row key by a hash of the row key.
>>> That will give you a more or less random distribution of the keys and hence not
hit the same region server over and over.
>>> 
>>> You'll probably also want to presplit your table then.
>>> 
>>> -- Lars
>>> 
>>> 
>>> 
>>> ----- Original Message -----
>>> From: Ramasubramanian <ramasubramanian.narayanan@gmail.com>
>>> To: user@hbase.apache.org
>>> Cc: 
>>> Sent: Tuesday, September 11, 2012 10:39 AM
>>> Subject: Regarding rowkey
>>> 
>>> Hi,
>>> 
>>> What can be used as rowkey to improve performance while loading into hbase? Currently
I am having sequence. It takes some 11 odd minutes to load 1 million record with 147 columns.
>>> 
>>> Regards,
>>> Rams 
>>> 
>> 


Mime
View raw message