harmony-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Ellison <t.p.elli...@gmail.com>
Subject Re: Performance improvement of java.util.HashMap (Was Re: [jira] Commented: (HARMONY-4064) [classlib][luni] Performance improvement of java.util.HashMap)
Date Wed, 27 Jun 2007 16:36:32 GMT
Jimmy,Jing Lv wrote:
>    As currently focus on performance, I find this JIRA(4064) and do
> some investigation on it.

Great.

>    The Performance improvement of java.util.HashMap is about two
> aspects: one is hash algorithm that make hashcode randomly distributed
> enough, just as Tim's fix/comments in Harmony-4064, the other is about
> to make hash algorithm itself perform fast enough. HashMap is so
> widely used so that both two aspects may be very important.

Just to be clear, you are talking about computing the hash 'bucket'
index for a given key's hashcode.

>    I strongly agree with Tim that hash algorithm that make hashcode
> randomly distributed is very important, for HashMap.put(), get(), etc,
> depends on it and it should perform better if the hash algorithm can
> make it randomly distributed.

Of course, you make a bet based upon the range of keys' hashcode values.

The more we know about the hashcode values the better we can design the
function to evenly distribute the keys across the available buckets.  We
used to perform badly for objects' identity hashcodes by dividing it by
2^n, which didn't give a good distribution.

> On the other hand, the hash algorithm should perform very fast as
> every HashMap.put and get method call this hash algorithm, if it is
> slow itself, then its benfits on randomly distribution became
> meaningless.

Agreed, a fast bucket index computation that gives an even distribution
is the ideal.

>    I do a lot of benchmark and micro-benchmarks,  it shows that (a &
> b) is much faster than (a % b).

where 'a' is the key's hashcode and 'b' is the number of buckets...

> And if b is Math.pow(2,n)-1, the two operations just get the same
> result. That's why if we choose "&(Math.pow(2,n)-1)" (here
> Math.pow(2,n) is the size of bucket of a hashmap)to replace "%", by
> this way we improve performance greatly at least 2% ~ 3% in some
> benchmarks . This shows if we avoid slow operation like mod(%),
> multiply(*), division(/) and so forth, we can get performance
> improvement.

Agreed, so if we limit the number of buckets to always be a power of
two, then you can use the fast bit operation for computing a bucket index.

So now you have the problem that identity hashcodes are often the
object's memory address at some point in time, and many implementations
align the objects on boundaries that won't give a good even distribution
when divided by 2^n.

> Yes, I do think to randomly distrubute the hashcode is very important
> and I suggest we can use other machnisms other than mod prime to make
> it happen.

For which set of hashcodes?  Those that are typically an object's
identity hashcode?  I guess that is the most popular key hashcode, but I
don't know for sure.

> One thought is right-shift the hashcode and combine bits together (by
> AND or OR or Exclusive Or operation) into low order(e.g, (a & (a >>
> 8) ^ (a >> 16) | (a >> 24)) & (bucketSize-1), a small test on it
> shows that it does distributed more randomly, but it may be better
> algorithms.

Why so complex bit shifting?  How did you settle on this?

If you are optimizing for the identity hashcode case then you can
probably ignore the bottom 'y' bits due to object alignment (and top 'z'
bits for address range limits), so just do (a >> 8) & (bucketSize - 1)
assuming the objects are 8-bit aligned.

> In this way the hashcode may be more random destributed and performs
> fast. And in this way we may keep a balance of a both fast and good
> hash algorithm.

An honorable goal!

Regards,
Tim


Mime
View raw message