hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rob Roland <...@simplymeasured.com>
Subject RE: Use of MD5 as row keys - is this safe?
Date Fri, 20 Jul 2012 19:21:59 GMT
I use a SHA1 hash of an identifier as a rowkey and store the unhashed
version in a "metadata" column family. Makes for a good distribution of
keys and an easy thing to pre-split tables on.
From: Michel Segel
Sent: 7/20/2012 12:16 PM
To: user@hbase.apache.org
Cc: user@hbase.apache.org
Subject: Re: Use of MD5 as row keys - is this safe?
I don't believe that there has been any reports of collisions, but if.
You are concerned you could use the SHA-1 for generating the hash.
Relatively speaking, SHA-1is slower, but still fast enough for most
applications.

Don't know if it's speed relative to an MD5 and string cat, but it
should yield a smaller key.

Sent from a remote device. Please excuse any typos...

Mike Segel

On Jul 20, 2012, at 11:31 AM, Damien Hardy <dhardy@figarocms.fr> wrote:

> Le 20/07/2012 18:22, Jonathan Bishop a écrit :
>> Hi,
>>
>> I know it is a commonly suggested to use an MD5 checksum to create a row
>> key from some other identifier, such as a string or long. This is usually
>> done to guard against hot-spotting and seems to work well.
>>
>> My concern is that there no guard against collision when this is done - two
>> different strings or longs could produce the same row-key. Although this is
>> very unlikely, it is bothersome to consider this possibility for large
>> systems.
>>
>> So what I usually do is concatenate the MD5 with the original identifier...
>>
>> MD5(id) + id
>>
>> which assures that the rowkey is both randomly distributed and unique.
>>
>> Is this necessary, or is it the common practice to just use the MD5
>> checksum itself?
>>
>> Thanks,
>>
>> Jon
>
> Hello Jonathan,
>
> md5(id)+id is the good way to avoid hotspotting and insure uniqueness.
>
> md5(id)[0]+id could be an other way to limit randomness of the rowid on
> 16 values
> You can now combine (with OR logic) 16 filters in a scanner (on for each
> letter available in md5 digest)
> it limits the balance on 16 potentials regions olso.
>
> Cheers,
>
> --
> Damien
>

Mime
View raw message