avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "James Baldassari (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field
Date Thu, 07 Jul 2011 00:07:16 GMT

    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060937#comment-13060937

James Baldassari commented on AVRO-853:

I've recently run into this hash code performance issue as well.  Scott, there may be a way
to work around the issue of propagating changes up the reference graph.  We would have to
calculate hash codes by computing the hash code for any "local" state, which could be cached,
and then adding to that local hash code the hash codes of all child objects, which could also
be cached independently.  For example, a RecordSchema has 'aliases', 'doc', 'name', 'props',
and 'isError', for its local state, and the hash code for those values could be cached.  Then
to calculate the rest of the RecordSchema hash code we invoke hashCode() on all Field instances
in the RecordSchema.  Each Field instance knows whether it has been modified or not, and so
it can either return a cached hash code or recalculate the hash code for its local state,
and so on.  So we would still need to traverse the whole schema graph every time the hash
code is requested, but we might achieve some performance gains by caching values at each node
in the graph.  Does that make sense?

Also, I took a look at the patch, and I have a couple of comments in addition to Doug's and

Is it safe to assume that a calculated hash code will never be 0?  Maybe null would be a safer
choice for the default/invalidated value.

Thread-safety is actually an issue here.  Since addProp() and getProp() are synchronized we
have to assume that Schema is intended to be used by multiple threads.  The worst-case scenario
for the unsynchronized hash code cache is worse than just having 2 threads calculate the hash
code at once.  If one thread is modifying the schema while another is calling hashCode(),
it could result in a temporary inconsistency between hashCode() and equals().  This could
lead to some confusing problems with hash maps/sets.  If the caching is abstracted up into
the Schema base class as Doug suggests, it would be fairly simple to synchronize access to
the cache.  This could be done with a synchronized method/block or by using something like
a ReadWriteLock, which would probably have better performance characteristics for a read-frequently-write-infrequently
use case such as this.  In fact, there may already be a synchronization issue with Schema
because the properties map is not accessed in a synchronized way in equals() or hashCode()...

> Cache hash codes in Schema and Field
> ------------------------------------
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853.patch
> We are experiencing a serious performance degradation when trying to store/retrieve fields
and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are
immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes
sense to cache the hash code on the object instead of recalculating every time the hashCode
method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message