harmony-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jimmy,Jing Lv" <firep...@gmail.com>
Subject Re: Performance improvement of java.util.HashMap (Was Re: [jira] Commented: (HARMONY-4064) [classlib][luni] Performance improvement of java.util.HashMap)
Date Thu, 28 Jun 2007 03:46:18 GMT
2007/6/28, Tim Ellison <t.p.ellison@gmail.com>:
> Jimmy,Jing Lv wrote:
> >    As currently focus on performance, I find this JIRA(4064) and do
> > some investigation on it.
>
> Great.
>
> >    The Performance improvement of java.util.HashMap is about two
> > aspects: one is hash algorithm that make hashcode randomly distributed
> > enough, just as Tim's fix/comments in Harmony-4064, the other is about
> > to make hash algorithm itself perform fast enough. HashMap is so
> > widely used so that both two aspects may be very important.
>
> Just to be clear, you are talking about computing the hash 'bucket'
> index for a given key's hashcode.
>
> >    I strongly agree with Tim that hash algorithm that make hashcode
> > randomly distributed is very important, for HashMap.put(), get(), etc,
> > depends on it and it should perform better if the hash algorithm can
> > make it randomly distributed.
>
> Of course, you make a bet based upon the range of keys' hashcode values.
>
> The more we know about the hashcode values the better we can design the
> function to evenly distribute the keys across the available buckets.  We
> used to perform badly for objects' identity hashcodes by dividing it by
> 2^n, which didn't give a good distribution.
>
> > On the other hand, the hash algorithm should perform very fast as
> > every HashMap.put and get method call this hash algorithm, if it is
> > slow itself, then its benfits on randomly distribution became
> > meaningless.
>
> Agreed, a fast bucket index computation that gives an even distribution
> is the ideal.
>
> >    I do a lot of benchmark and micro-benchmarks,  it shows that (a &
> > b) is much faster than (a % b).
>
> where 'a' is the key's hashcode and 'b' is the number of buckets...
>
> > And if b is Math.pow(2,n)-1, the two operations just get the same
> > result. That's why if we choose "&(Math.pow(2,n)-1)" (here
> > Math.pow(2,n) is the size of bucket of a hashmap)to replace "%", by
> > this way we improve performance greatly at least 2% ~ 3% in some
> > benchmarks . This shows if we avoid slow operation like mod(%),
> > multiply(*), division(/) and so forth, we can get performance
> > improvement.
>
> Agreed, so if we limit the number of buckets to always be a power of
> two, then you can use the fast bit operation for computing a bucket index.
>
> So now you have the problem that identity hashcodes are often the
> object's memory address at some point in time, and many implementations
> align the objects on boundaries that won't give a good even distribution
> when divided by 2^n.
>
> > Yes, I do think to randomly distrubute the hashcode is very important
> > and I suggest we can use other machnisms other than mod prime to make
> > it happen.
>
> For which set of hashcodes?  Those that are typically an object's
> identity hashcode?  I guess that is the most popular key hashcode, but I
> don't know for sure.
>
> > One thought is right-shift the hashcode and combine bits together (by
> > AND or OR or Exclusive Or operation) into low order(e.g, (a & (a >>
> > 8) ^ (a >> 16) | (a >> 24)) & (bucketSize-1), a small test on it
> > shows that it does distributed more randomly, but it may be better
> > algorithms.
>
> Why so complex bit shifting?  How did you settle on this?
>
> If you are optimizing for the identity hashcode case then you can
> probably ignore the bottom 'y' bits due to object alignment (and top 'z'
> bits for address range limits), so just do (a >> 8) & (bucketSize - 1)
> assuming the objects are 8-bit aligned.
>

It sounds interesting and reasonable :)
 But I'm afraid not all hashcode() returns the object's memory
address, in some other cases, the difference of hashcodes may remain
in buttom 'y' bits or top 'z' bits, at that time our HashMap will
perform very slow.
So the best way is to use all 32 bit of a hashcode, and take extra
operator '|' and '^' does not cost much time.( though I'm not sure if
the the right-shift paramter '8' is proper)

> > In this way the hashcode may be more random destributed and performs
> > fast. And in this way we may keep a balance of a both fast and good
> > hash algorithm.
>
> An honorable goal!
>
> Regards,
> Tim
>
>


-- 

Best Regards!

Jimmy, Jing Lv
China Software Development Lab, IBM

Mime
View raw message