2007/7/4, Tim Ellison <t.p.ellison@gmail.com>:
> Jimmy,Jing Lv wrote:
> > I've done some further study on hashcode and HashMap. I find it
> > very interesting that RI maybe has done something on IdentityHashCode
> > just as Sergey said. According to Tim's microbenchmark on
> > JIRA4046[1], Harmony classlib+IBMVME shows that only 4 bucket are
> > used, each contain 250 elements, while RI used up all 16 buckets and
> > each of them contains 6264 elements. What's more, if we print RI's
> > objects's hashcode, it was ended with all kind of numbers (e.g.
> > 5b8827,147c1db ...) while Harmony+IBMVME return the exact
> > address(not sure) which was ended with even number(e.g.
> > 77387738,77787778 ...). I haven't try DRLVM at the time (just like
> > IBMVME?)
>
> I'd say that the IBM VME should do a better job!
>
> > So if we do a little modify on the bucket index calculation with
> > Tim's test, e.g,
> > int bucket = hashCode >> 1 & (numBuckets  1);
> > It shows all bucket are used though some of them get much more
> > elements than others. And if we modify as:
> > int bucket = hashCode >> 2 & (numBuckets  1);
> > It just appears as RI do, every bucket got nearly the same number
> > of elements
> > If we apply this (hashCode >> 2 & (numBuckets  1)) to the HashMap
> > bucket index algorism, and do some micro benckmark like put/get/remove
> > for 10000 times[2], it will improve 7% (~28s to ~26s on my desktop).
>
> That's cool (and of course you could rescale the divisor instead of
> shifting the hashCode each time and get the same effect).
>
> > But it is properly not a good idea for HashMap, for some
> > userdefined hashcode will work very bad in this algorism. Some very
> > famous benchmark, like Specjbb, will at once reduce 10 ~ 15% in score.
> > It may not be a proper algorism for classlib buckets index
> > calculation.
>
> Agreed, and with apologies for sounding like a broken record, IMHO the
> classlib code should not optimize for an inadequate distribution of an
> undefined hashcode algorithm (unless it is so prevalent that it is a de
> facto standard).
>
> > So it still requires algorism like we discussed before, though it
> > was not easy at all. If we agree on use &(bucketsize 1) (because it
> > is much faster than % prime),
>
> I think we have agreed on that...
>
> > we should find out a way to gather all
> > 32bit information into log(2,bucketsize) bits. Operator '+', '',
> > '^', '>>' are used for this target (they are fast themselves). However
> > bucketsize are too small usually (at the beginning it was 16, 4bit)
> > to store all 32bit differences, we have to lose some information
> > after all.
>
> Exactly, no amount of bittwiddling will help you contain 32bits of
> information in 4bits! The knack is knowing which four bits to select
> from the hashCode. Unless you know the distribution of the hashCode
> values is in fact a subset of all the possible 32bit values allowed by
> the spec, you can only assume that they are random  at which point you
> can select any four bits.
>
> > Let's first take two hypothesis: 1) hashcode (and user defined
> > hashcode) are distributed averagely themselves 2) however the lower
> > bits are more important than higher bits(it is imposible for them to
> > be equal, especially in small buckets)
>
> You mean more important to the hashmap implementation because they are
> used by &(numBuckets1) to compute the bucket index, right?
>
> > Under these two hypothesis, we can focus on lower bits. It solved
> > Yuri's problem that upper bits changes may not effects on index of the
> > buckets, it does not matter(as maybe happen in most cases). And
> > however the indexsearching algorism can not be too complex, 4 or 5
> > operators may be enough to make hashcode more distributed averagely in
> > the buckets, and avoid the cost on the searching.
> > After reading some papers and hash algorism (however most of them
> > are too complex and timeconsuming for us), currently I have an
> > expression here:
> > index = hashcode ? hashcode >> a ? hashcode >> b ( ? hashcode >>
c)
> > where "?" stands for one of operator of '^,+,,&' and a,b,c stand
> > for a number less then 32 , a < b < c (I don't know if last operator
> > and c can be omitted).
> > Our task is to find best a,b,c and operators between them to
> > improve hashcode performance. Currently the best expression I've find
> > is :
> > (hash ^ hash >> 9) + hash >> 23
> > or
> > (hash ^ hash >> 9) +( hash >>15  hash >> 23)
> > both of them work quite well on my micro benchmark and improve a
> > little in Specjbb(at least not worse than using "hash" it alone :) )
> > But I still believe it can improve a lot.
>
> I think you are optimizing for a particular implementation of
> identityHashCode, and the class library code should be agnostic.
>
> > To sum up, 1) I believe Sergey's idea do helps a lot in
> > hashrelated classes, which requires VM guys to do some modification
> > in IdentityHashCode (a simple guess, object address >> 2 ?)
>
> I agree that the VM should do it's best job at producing a good
> distribution of values for identityHashCode.
>
> > 2) for userdefine hashcode and HashMap it self, based on
> > &(bucketsize 1), we can find some parameter for the expression I list
> > above, or find some similar expression to use, so that we can make it
> > distributed more averagely in the buckets.
>
> Sorry for being so dumb, but how can you devise an expression that will
> produce a better bucket index from a hashcode if you don't know anything
> about the characteristics of the hashcode?
>
Ah, I really don't know how to produce a better bucket index from a
hashcode if I don't know anything about the characteristics of the
hashcode. So I just take hypothesis that the hashcode should be random
itself, and in this scenario can we go further to do some tuning.
All suggestion here is used for HashMap implementation, as we dicussed
before (maybe also good for hashrelated classes ;) )
> Regards,
> Tim
>

Best Regards!
Jimmy, Jing Lv
China Software Development Lab, IBM
