hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Haidinyak <phaidin...@local.com>
Subject RE: OT - Hash Code Creation
Date Thu, 17 Mar 2011 16:44:25 GMT
Hash Code in Object is limited to an int and a quick look at HashMap and Trove's HashMap looks
like they are only using 31 bits of that. I am now trying a modified version of what Ted pointed
at and it seems to be working very well. I modified the original since only the last few bytes
in the key are usually unique so I start at both ends when creating the Hash Code. So far
I am half way through an import and the times went down from 24 seconds to 11 seconds on the
first few files and have been holding around 13 seconds at half way vs 45 seconds with the
old method.



-----Original Message-----
From: Christopher Tarnas [mailto:cft@tarnas.org] On Behalf Of Chris Tarnas
Sent: Thursday, March 17, 2011 9:21 AM
To: user@hbase.apache.org
Subject: Re: OT - Hash Code Creation

With 24 million elements you'd probably want a 64bit hash to minimize the risk of collision,
the rule of thumb is with 64bit hash key expect a collision when you reach about 2^32 elements
in your set. I half of a 128bit MD5 sum (a cryptographic hash so you can only use parts of
it if you want) as that is readily available in our systems and so far has not been a bottleneck.
I believe there is now a 64bit murmurhash, that would be faster to compute and ideal for what
you want. I've been using base-64 encoding when I use my hashes as rowkeys - makes them printable
while still being fairly dense, IIRC a 64bit key should be only 11 characters.


On Mar 17, 2011, at 12:30 AM, Pete Haidinyak wrote:

> Thanks, I'll give that a try.
> -Pete
> On Thu, 17 Mar 2011 00:23:00 -0700, Ted Dunning <tdunning@maprtech.com> wrote:
>> Double hashing is a find thing.  To actually answer the question, though, I
>> would recommend Murmurhash or JOAAT (
>> http://en.wikipedia.org/wiki/Jenkins_hash_function)
>> On Wed, Mar 16, 2011 at 3:48 PM, Andrey Stepachev <octo47@gmail.com> wrote:
>>> Try hash table with double hashing.
>>> Something like this
>>> http://www.java2s.com/Code/Java/Collections-Data-Structure/Hashtablewithdoublehashing.htm
>>> 2011/3/17 Peter Haidinyak <phaidinyak@local.com>
>>> > Hi,
>>> >        This is a little off topic but this group seems pretty swift so I
>>> > thought I would ask. I am aggregating a day's worth of log data which
>>> means
>>> > I have a Map of over 24 million elements. What would be a good algorithm
>>> to
>>> > use for generating Hash Codes for these elements that cut down on
>>> > collisions? I application starts out reading in a log (144 logs in all)
>>> in
>>> > about 20 seconds and by the time I reach the last log it is taking around
>>> > 120 seconds. The extra 100 seconds have to do with Hash Table Collisions.
>>> > I've played around with different Hashing algorithms and cut the original
>>> > time from over 300 seconds to 120 but I know I can do better.
>>> > The key I am using for the Map is an alpha-numeric string that is
>>> > approximately 16 character long with the last 4 or 5 character being the
>>> > most unique.
>>> >
>>> > Any ideas?
>>> >
>>> > Thanks
>>> >
>>> > -Pete
>>> >

View raw message