avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Carey (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AVRO-853) Cache hash codes in Schema and Field
Date Thu, 07 Jul 2011 16:55:17 GMT

    [ https://issues.apache.org/jira/browse/AVRO-853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061438#comment-13061438
] 

Scott Carey commented on AVRO-853:
----------------------------------

@Douglas

Good point, we can simplify the hash code functions on complicated members like Props and
Aliases.   We can either ignore props, or only use a simple to compute portion of it:  the
size.

How should equals with Aliases work?

Are the three below schemas equivalent?  

{code}
A: {"type":"record", "name":"foo", "fields":[{"name":"bar", "type":"string"}]}
B: {"type":"record", "name":"foo", "fields":[{"name":"bar", "type":"string"}], "aliases":["foo2"]}
C: {"type":"record", "name":"foo2", "fields":[{"name":"bar", "type":"string"}], "aliases":["foo"]}
{code}

Keep in mind that equals must be transitive, if A == B and B == C implies C == A,  and symmetric
C.equals(A) must be true if A.equals(C) is true.
In the above, aliases allow A == B == C.

But this represents a problem for other cases:
{code}
A: {"type":"record", "name":"foo", "fields":[{"name":"bar", "type":"string"}]}
B: {"type":"record", "name":"foo", "fields":[{"name":"bar", "type":"string"}], "aliases":["foo2"]}
C2: {"type":"record", "name":"foo2", "fields":[{"name":"bar", "type":"string"}]}
{code}

Aliases allow A == B, and B == C2, but A != C2.  Therefore, we can use aliases in equality
only two ways:
# not at all
# exact match only

This means that either 
* Ignore aliases: A == B, B != C, A != C
* Exct match only: A != B, B != C, A != C

I vote for ignoring Aliases in equality checks as we currently do, and having a different
version of equals for checking for the ability to transform one schema to another using aliases
 "alias promotion".

This is an assymetric process that does not have the transitive property.  A promotesTo B,
B promotesTo C2, A !promotesTo C2


I also suspect that we should remove props from equals().  I think those behave similar to
aliases.

Are the four schemas below different?  Should they differ across languages or do they represent
different data?
{code}
A: {"type":"array", "items":"int"}
B: {"type":"array", "items":"int", "java.typehint":"java.util.List"}
C: {"type":"array", "items":"int", "java.typehint":"intarray"}
{code}

One could argue that these are equal (the serialized form is the same, and the extra properties
are only specific to one language implementation).
The props here are just specialized documentation.

I think we have two consistent choices:
* Schemas are equal only if all aliases, props, and doc fields match exactly -- in other words
if toString() prints the same result.
* Schemas are equal based on name, type, and structure alone.

> Cache hash codes in Schema and Field
> ------------------------------------
>
>                 Key: AVRO-853
>                 URL: https://issues.apache.org/jira/browse/AVRO-853
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.5.1
>            Reporter: Douglas Kaminsky
>         Attachments: AVRO-853-approach2.patch, AVRO-853.patch
>
>
> We are experiencing a serious performance degradation when trying to store/retrieve fields
and schemas in hash-based data structures (eg. HashMap). Since all fields and schemas are
immutable (with the exception of RecordSchema allowing deferred setting of Fields) it makes
sense to cache the hash code on the object instead of recalculating every time the hashCode
method gets called. 
> (Are there other mutable Schema sub-types that I'm not thinking about?)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message