avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tie Liu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AVRO-1428) Schema.computeHash() to add if check to avoid unnecessary hashcode computation
Date Wed, 15 Jan 2014 22:30:25 GMT

    [ https://issues.apache.org/jira/browse/AVRO-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13872703#comment-13872703
] 

Tie Liu commented on AVRO-1428:
-------------------------------

it's half right that I suggest cache the field hashCode. In the field.hashCode:
        public int hashCode()
        {
            return name.hashCode() + schema.computeHash();
        }
The way I suggest is to avoid the "schema.computeHash()" in the field.hashCode since the hashcode
for the field schema may already been calculated. What I really want is field.hashCode:
        public int hashCode()
        {
            return name.hashCode() + schema.hashCode();
        }
but  we cannot do that since the schema.hashCode is defined as final, and subclasses of Schema
(recordShema/UnionSchema/etc.) is overriding the computeHash instead.

so to solve the problem, we need to either make schema.hashCode not final, and fields.hashCode
call schema.hashCode, or move the hashcode cache check inside computeHash. How do you think?


> Schema.computeHash() to add if check to avoid unnecessary hashcode computation
> ------------------------------------------------------------------------------
>
>                 Key: AVRO-1428
>                 URL: https://issues.apache.org/jira/browse/AVRO-1428
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>            Reporter: Tie Liu
>         Attachments: AVRO-1428.patch, AVRO-1428.patch-V2, TestAvro.java
>
>
> In current Schma.java we have following implementation:
>   public final int hashCode() {
>     if (hashCode == NO_HASHCODE)
>       hashCode = computeHash();
>     return hashCode;
>   }
>   int computeHash() { return getType().hashCode() + props.hashCode(); }
>   While hashCode is doing the checking of "if (hashCode == NO_HASHCODE)", the computeHash
method is not. But the computeHash method is being called from Schema$Field.hashCode and the
subclasses hashCode implementation like following:
> public int hashCode() { return name.hashCode() + schema.computeHash(); }  //this is from
Schema$Field class
>   This is causing the the calculation of hashCode getting called unnecessarily extensively.
The proposed changed is to add the "if" check inside the computeHash method instead:
>     int computeHash()
>     {
>         if (hashCode == NO_HASHCODE)
>         {
>             hashCode = getType().hashCode() + props.hashCode();
>         }
>         return hashCode;
> }
> We did a simple test to compare the performance difference, below is a summary of the
heap snapshot of comparing the difference:
> As a test I wrote a small program that creates a HashMap<Schema.Field, Integer>()
and enters a loop simply identifying whether various Schema.Field instances are keys in the
map. Obviously this is a pathological test case, but when running with the current implementation
of Schema.Field it has (in about 30 seconds) used up nearly 8 GBytes of heap in instantiating
intermediate objects associated with calling Schema.computeHash():
> Heap
> PSYoungGen      total 17432576K, used 8666481K [0x0000000340000000, 0x0000000800000000,
0x0000000800000000)
>   eden space 14942208K, 58% used [0x0000000340000000,0x0000000550f5c650,0x00000006d0000000)
>   from space 2490368K, 0% used [0x0000000768000000,0x0000000768000000,0x0000000800000000)
>   to   space 2490368K, 0% used [0x00000006d0000000,0x00000006d0000000,0x0000000768000000)
> ParOldGen       total 1048576K, used 0K [0x0000000300000000, 0x0000000340000000, 0x0000000340000000)
>   object space 1048576K, 0% used [0x0000000300000000,0x0000000300000000,0x0000000340000000)
> PSPermGen       total 21504K, used 5782K [0x00000002fae00000, 0x00000002fc300000, 0x0000000300000000)
>   object space 21504K, 26% used [0x00000002fae00000,0x00000002fb3a5818,0x00000002fc300000)
> When running with the modified implementation (and no other change) all the object allocation
vanishes:
> Heap
> PSYoungGen      total 17432576K, used 896532K [0x0000000340000000, 0x0000000800000000,
0x0000000800000000)
>   eden space 14942208K, 6% used [0x0000000340000000,0x0000000376b852d0,0x00000006d0000000)
>   from space 2490368K, 0% used [0x0000000768000000,0x0000000768000000,0x0000000800000000)
>   to   space 2490368K, 0% used [0x00000006d0000000,0x00000006d0000000,0x0000000768000000)
> ParOldGen       total 1048576K, used 0K [0x0000000300000000, 0x0000000340000000, 0x0000000340000000)
>   object space 1048576K, 0% used [0x0000000300000000,0x0000000300000000,0x0000000340000000)
> PSPermGen       total 21504K, used 5768K [0x00000002fae00000, 0x00000002fc300000, 0x0000000300000000)
>   object space 21504K, 26% used [0x00000002fae00000,0x00000002fb3a2240,0x00000002fc300000)
> As a side-effect the test runs x3 faster with the modified hashCode() implementation.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message