db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Knut Anders Hatlen (JIRA)" <j...@apache.org>
Subject [jira] Commented: (DERBY-3981) Improve distribution of hash codes in SQLBinary and SQLChar
Date Tue, 06 Jan 2009 11:25:44 GMT

    [ https://issues.apache.org/jira/browse/DERBY-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661106#action_12661106

Knut Anders Hatlen commented on DERBY-3981:

Thanks for looking at the patch, Kristian. Please see my answers to
your questions below.

> 1) The hashcode isn't saved. Is this because of the reuse of the
> SQLChar object?

It should be possible to save the hash code in a field and reset it
when the SQLChar is reused. (Same goes for SQLBinary, I believe.) I
don't think caching it will have any effect on the performance of the
SELECT DISTINCT test, since hashCode() should only be called once on
each object when they are inserted into the hash table. There may be
other cases where hashCode() is called many times on the same object,
though, so it might have a positive effect for some queries. I didn't
change this since the old code also had to loop through the entire
underlying array, so the need for caching shouldn't be much changed by
the patch. Another thing to consider is that adding a field to a data
type class increases the object size as reported by
org.apache.derby.iapi.services.cache.ClassSize and makes it (slightly)
less likely that the optimizer chooses a hash join strategy.

> 2) Is the hashCode method invoked as part of normal Derby usage for
>    Clobs?  I'm asking because SQLClob inherits the hashCode method
>    of SQLChar, which causes the value to be materialized.

No, I don't think so. When I looked at DERBY-3975 (see comment made on
17/Dec/08) I concluded that clob (and long varchar) columns couldn't
be compared without casting them to another data type (like char or
varchar) first, so hashCode(), equals() and compareTo() are not called
on SQLClob.

> 3) Not a question, but we are overriding both equals and
>  hashCode. For SQLChar I believe the relationship between the two
>  methods is correct (i.e. two equal objects must have the same
>  hashcode, pad characters are "ignored" in both methods).

Right. And SQLBinary.equals() uses SQLBinary.compare() which also
ignores trailing pad bytes (0x20). Note however that the original
SQLBinary.hashCode() ignored all occurrences of pad bytes, whereas the
new implementation only ignores trailing pad bytes.

> Improve distribution of hash codes in SQLBinary and SQLChar
> -----------------------------------------------------------
>                 Key: DERBY-3981
>                 URL: https://issues.apache.org/jira/browse/DERBY-3981
>             Project: Derby
>          Issue Type: Improvement
>          Components: Newcomer, Performance, SQL
>    Affects Versions:
>            Reporter: Knut Anders Hatlen
>            Assignee: Knut Anders Hatlen
>            Priority: Minor
>         Attachments: d3981.diff, distinct-test.diff
> SQLBinary.hashCode() and SQLChar.hashCode() use a very simple algorithm that just takes
the sum of the values in the array. This gives a poor distribution of hash values because
similar values will have a higher probability of mapping to the same hash code, and the higher
bits won't be used unless the array is very long. We should change these methods so that they
use an algorithm similar to the one used in java.lang.String.hashCode(), described here: <URL:http://java.sun.com/javase/6/docs/api/java/lang/String.html#hashCode()>.
This may have a positive effect on the performance of hash scans as it will reduce the likelihood
of collisions in the hash table.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message