mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <dawid.we...@gmail.com>
Subject Re: [collections] and what about 'identity'?
Date Fri, 02 Apr 2010 14:06:55 GMT
> What's the use case for needing to vary the hash function? It's one of
> those things where I assume there are incorrect ways to do it, and
> correct ways, and among the correct ways fairly clear arguments about
> which function will be better -- i.e. the object should provide the
> best function.

Unfortunately this is not true -- just recently I've hit a use case
where the keys stored were Long values and their distribution had a
very low variance in the lower bits. HPPC implemented open hashing
using 2^n arrays and hashes were modulo bitmask... this caused really,
really long conflict chains for values that were actually very
different. I looked at how JDK's HashMap solves this problem -- they
do a simple rehashing scheme internally (so it's object hash and then
remixing hash in a cascade). I've finally decided to allow external
hash functions AND changed the _default_ hash function used for
"remixing" to be murmur hash. Performance benchmarks show this yields
virtually no degradation in execution time (the CPUs seem to spend
most of their time waiting on cache misses anyway, so internal
rehashing is not an issue).

I must also apologize for a bit of inactivity with HPPC... Like I
said, we have released it internally on our "labs" Web site here:

http://labs.carrotsearch.com/hppc.html

It doesn't mean we turn our backs on contributing HPPC to Mahout --
the opposite, we would love to do it. But contrary to what I
originally thought (to push HPPC to Mahout as soon as possible) I kind
of grew reluctant because so many things are missing (equals/hashcode,
java collections adapters) or can be improved (documentation, faster
iterators).

So... I'm still going to experiment with HPPC in our labs, especially
API-wise, release one or two versions in between and then kindly ask
you to peek at the final (?) result and consider moving the code under
Mahout umbrella. Sounds good?

Dawid

Mime
View raw message