Mailing-List: contact dev-help@harmony.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@harmony.apache.org
Received-SPF: pass (herse.apache.org: domain of t.p.ellison@gmail.com
 designates 66.249.90.176 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=beta;
        h=received:message-id:date:from:user-agent:mime-version:to:subject:references:in-reply-to:x-enigmail-version:content-type:content-transfer-encoding;
        b=WN+YYt0WCCK8/OEUiTRrLVAXnK329mQ6hMJZprWG0vyFCUEBVL8JSqYpUfJoUnjT4H2LEbsRTJvxSuWd6UHZw6zpQyd+NAckarfJKrC9CKDetL2oGwHNUSRZ/aSlonXz9pm62kxqFFqZEmIJM12Dd//A6uFbBzn4BL0940hh2ZE=
Message-ID: <46851389.10603@gmail.com>
Date: Fri, 29 Jun 2007 15:13:29 +0100
From: Tim Ellison <t.p.ellison@gmail.com>
User-Agent: Thunderbird 2.0.0.4 (Windows/20070604)
MIME-Version: 1.0
To: dev@harmony.apache.org
Subject: Re: Performance improvement of java.util.HashMap
References: <5c8e69f0706270310k38a93b99oa84e92d7d54220c4@mail.gmail.com>
 <253b20230706290634x12740163h6f18dedadf7dbc7c@mail.gmail.com>
In-Reply-To: <253b20230706290634x12740163h6f18dedadf7dbc7c@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

Sergey Kuksenko wrote:
> I wish to include myself into discussion and express some my thought
> in this area.

Great -- the more input the better!

> I am absolutely agree with the idea of using bitwise operations & (or
>  power of two remainder) instead of % (arbirary remainder) because of
> power of two remainder is really faster and was proved by my tests
> and etc...
> 
> First of all I wish to note that if HashMap works with user-defined 
> HashCode it doesn't matter which reminder is used: power fo two (&)
> or prime(or arbitrary) number reminder (%). In case if user defined
> HashCode we can't predict hash and the hash can be suitable as for &
> and for %.

Right, I think we pretty much have to ignore the user defined hashCode()
since we have no insight into its characteristics, other than the
obvious ones like it being a 32-bit value.

> But prime number reminder has one disadavntages (except speed of 
> operation). When HashMap uses prime number (or some irreglar) we get
> negative effect of unused part of memory page. Size of memory pages
> is power of two and it is better to fill it completely. Espesially it
> is visible on large hash maps. Unfortunately I can privide any
> numbers at the moment. Thus in my opinion power of two reminder gives
> boost in time of operation plus benefits from more natural memory
> usage (layouts etc.)

Good point, for the memory page occupancy of the contiguous
'elementData' array of map entries.

In a somewhat related thought, it may be worth experimenting with
pre-allocating a number of Entries, to get better memory locality for
the whole hash map, but I digress...

> As to default hashCode which is IdentityHashCode.

I don't know what object types are the most common for hashmap keys.
For example, if it turns out that keys are usually Strings then the
hashCode() values' characteristics would be different.  But I agree that
we should do well for identityHashCode values since that probably
represents the biggest set of key types we want to accommodate.

> There are two ways here:
> 1. Add new hashing function (additional hashing) inside HashMap. As was
> suggested here.
> 2. Change IdentityHashCode implementation. There are no rules that 
> IdentityHashCode should return exactly address of object. And last
> zero bits from IdentityHashCode (which caused by object alignment)
> can be removed inside IdentityHashCode. Other words - to add
> additional hashing inside IdentityHashCode.

There is no rule that identityHashCode should be the address, you are
right - but I think that is a common implementation.  We would like the
class library code to work well on multiple-VM implementations.

> The first solution has one advantage - it can be done only in 
> classlib. The second variant should be implemented in all VMs which
> are based on Harmony's classlib. But this way has other advantages -
> Well-formed IdentityHashCode will helps not only for HashMap, but for
> Hashtable, IdentityHashMap

Well we could use whatever technique we settle on in all our hashed
collections<g>

>  and moreover for all hashing implemented by users.

how so? unless they query the object identityhashCode, which I think is
not usually the case when overriding hashCode().

> In the second variant we don't add rehashing function into HashMap.
> It is common practice when advanced Java developers tune their
> HashCode functions. If we add rehashing function into HashMap we may
> break their tuning, without knowledge of internal rehashing functions
> it is impossible to tune HashCode function.

I agree that only applying a very simple transformation would be
appropriate so as not to 'break' users' hash functions.  Plus it has the
advantage of not taking a long time in our code<g>.  That's why I had
suggested a simple rotate right operation (to help with the common
implementation of identityHashCode situation).

> From my opinion we should think about the last argument against adding
> rehashing into HashMap and move rehashing into VM.

So you would recommend using the key.hashCode() unmodified?

Regards,
Tim