Return-Path: X-Original-To: apmail-avro-dev-archive@www.apache.org Delivered-To: apmail-avro-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D2528971D for ; Mon, 31 Oct 2011 16:13:53 +0000 (UTC) Received: (qmail 31978 invoked by uid 500); 31 Oct 2011 16:13:53 -0000 Delivered-To: apmail-avro-dev-archive@avro.apache.org Received: (qmail 31931 invoked by uid 500); 31 Oct 2011 16:13:53 -0000 Mailing-List: contact dev-help@avro.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@avro.apache.org Delivered-To: mailing list dev@avro.apache.org Received: (qmail 31923 invoked by uid 99); 31 Oct 2011 16:13:53 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 31 Oct 2011 16:13:53 +0000 X-ASF-Spam-Status: No, hits=-2001.2 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 31 Oct 2011 16:13:52 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 3D67732A9D8 for ; Mon, 31 Oct 2011 16:13:32 +0000 (UTC) Date: Mon, 31 Oct 2011 16:13:32 +0000 (UTC) From: "Doug Cutting (Commented) (JIRA)" To: dev@avro.apache.org Message-ID: <655477766.41229.1320077612252.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <195985141.27168.1319751032275.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (AVRO-946) GenericData.resolveUnion() performance improvement MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/AVRO-946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140260#comment-13140260 ] Doug Cutting commented on AVRO-946: ----------------------------------- Identity equality may result in multiple entries for a given schema but the cache should still work correctly. It would perform poorly if every instance had a different schema, but that's not likely. Also note that Schema now caches hash codes. So even using equals hashing would usually only result in a single call to equals, to verify the hash entry. Equals is fast for identical objects, so, if you used equals hashing, the slow case would be when the cached key is equal but not identical. I think identity hashing with weak keys is probably preferable. > GenericData.resolveUnion() performance improvement > -------------------------------------------------- > > Key: AVRO-946 > URL: https://issues.apache.org/jira/browse/AVRO-946 > Project: Avro > Issue Type: Improvement > Components: java > Affects Versions: 1.6.0 > Reporter: Hernan Otero > > Due to the sequential nature of today's implementation of GenericData.resolveUnion() (used when serializing an object): > {code} > public int resolveUnion(Schema union, Object datum) { > int i = 0; > for (Schema type : union.getTypes()) { > if (instanceOf(type, datum)) > return i; > i++; > } > throw new UnresolvedUnionException(union, datum); > } > {code} > it showed up when we were doing some serialization performance analysis. A simple optimization can be implemented by keeping a map within the UnionSchema object (in fact, this could actually be a perfect hash map given the potential values in the map are known in advance). The optimization is obviously most notable when a Union within the schema contains many types (in our particular use case, more than 40 in some cases). In this scenario, we observed a 25% improvement by using an identity hash map. > Even though using an identity map provides a significant boost, we have observed an even further improvement (and removed some of the restrictions of relying on object identity) by using a perfect hash map on the schema names (an extra 15% on top of that in some cases). This implementation, unfortunately, is not something we could contribute at this point, but we thought it'd be a good idea to allow users to provide alternative implementations of the indexing behavior, such as adding the following static method to Schema: > {code} > public static void setUnionTypeIndexCacheFactory(UnionIndexCacheFactory factory) > { > unionIndexCacheFactory = factory; > } > {code} > This is what the interface and identity hash map-based implementation would look like: > {code} > /** > * A factory interface for creating UnionTypeIndexCache instances. > */ > public static interface UnionIndexCacheFactory > { > UnionIndexCache createUnionIndexCache(List types); > /** > * Used for caching schema indices within a union. > */ > public static interface UnionIndexCache > { > void setTypeIndex(Schema schema, int index); > int getTypeIndex(Schema schema); > } > } > private static class IdentityMapUnionIndexCacheFactory implements UnionIndexCacheFactory > { > @Override > public UnionIndexCache createUnionIndexCache(List types) > { > return new UnionIndexCache() > { > private final IdentityHashMap schemaToIndex = new IdentityHashMap(); > @Override > public void setTypeIndex(Schema schema, int index) > { > schemaToIndex.put(schema, index); > } > @Override > public int getTypeIndex(Schema schema) > { > Integer index = schemaToIndex.get(schema); > return index == null ? -1 : index; > } > }; > } > } > {code} > I will attach a patch later today or early tomorrow. > Thanks in advance, > Hernan Otero -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira