avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hernan Otero (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (AVRO-946) GenericData.resolveUnion() performance improvement
Date Tue, 01 Nov 2011 20:45:33 GMT

    [ https://issues.apache.org/jira/browse/AVRO-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141567#comment-13141567
] 

Hernan Otero commented on AVRO-946:
-----------------------------------

On further thought, the proposed implementation does have a shortcoming.  In order to leverage
the optimization, the GenericDatumWriter needs to be shared.  And the current cache implementation
is not thread safe.

One option would be to make the cache thread safe (e.g. use a ConcurrentMap or similar structure),
a second option would be to move this all back to UnionSchema, but for the time being (pending
the longer term solution of making UnionSchema public and extensible), rely on a HashMap<String,
Integer> using the datum's schema's getFullName() as key (to avoid the need to rely on
identity).

Any thoughts/suggestions?

Thanks,

Hernan
                
> GenericData.resolveUnion() performance improvement
> --------------------------------------------------
>
>                 Key: AVRO-946
>                 URL: https://issues.apache.org/jira/browse/AVRO-946
>             Project: Avro
>          Issue Type: Improvement
>          Components: java
>    Affects Versions: 1.6.0
>            Reporter: Hernan Otero
>         Attachments: AVRO-946.patch
>
>
> Due to the sequential nature of today's implementation of GenericData.resolveUnion()
(used when serializing an object):
> {code}
>   public int resolveUnion(Schema union, Object datum) {
>     int i = 0;
>     for (Schema type : union.getTypes()) {
>       if (instanceOf(type, datum))
>         return i;
>       i++;
>     }
>     throw new UnresolvedUnionException(union, datum);
>   }
> {code}
> it showed up when we were doing some serialization performance analysis.  A simple optimization
can be implemented by keeping a map within the UnionSchema object (in fact, this could actually
be a perfect hash map given the potential values in the map are known in advance).  The optimization
is obviously most notable when a Union within the schema contains many types (in our particular
use case, more than 40 in some cases).  In this scenario, we observed a 25% improvement by
using an identity hash map.
> Even though using an identity map provides a significant boost, we have observed an even
further improvement (and removed some of the restrictions of relying on object identity) by
using a perfect hash map on the schema names (an extra 15% on top of that in some cases).
 This implementation, unfortunately, is not something we could contribute at this point, but
we thought it'd be a good idea to allow users to provide alternative implementations of the
indexing behavior, such as adding the following static method to Schema:
> {code}
> public static void setUnionTypeIndexCacheFactory(UnionIndexCacheFactory factory)
> {
>   unionIndexCacheFactory = factory;
> }
> {code}
> This is what the interface and identity hash map-based implementation would look like:
> {code}
>   /**
>    * A factory interface for creating UnionTypeIndexCache instances.
>    */
>   public static interface UnionIndexCacheFactory
>   {
>       UnionIndexCache createUnionIndexCache(List<Schema> types);
>       /**
>        * Used for caching schema indices within a union.
>        */
>       public static interface UnionIndexCache
>       {
>           void setTypeIndex(Schema schema, int index);
>           int getTypeIndex(Schema schema);
>       }
>   }
>   private static class IdentityMapUnionIndexCacheFactory implements UnionIndexCacheFactory
>   {
>       @Override
>       public UnionIndexCache createUnionIndexCache(List<Schema> types)
>       {
>           return new UnionIndexCache()
>           {
>               private final IdentityHashMap<Schema, Integer> schemaToIndex = new
IdentityHashMap<Schema, Integer>();
>               @Override
>               public void setTypeIndex(Schema schema, int index)
>               {
>                   schemaToIndex.put(schema, index);
>               }
>               @Override
>               public int getTypeIndex(Schema schema)
>               {
>                   Integer index = schemaToIndex.get(schema);
>                   return index == null ? -1 : index;
>               }
>           };
>       }
>   }
> {code}
> I will attach a patch later today or early tomorrow.
> Thanks in advance,
> Hernan Otero

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message