Return-Path: Delivered-To: apmail-avro-user-archive@www.apache.org Received: (qmail 40926 invoked from network); 10 Jun 2010 22:44:44 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 10 Jun 2010 22:44:44 -0000 Received: (qmail 40103 invoked by uid 500); 10 Jun 2010 22:44:43 -0000 Delivered-To: apmail-avro-user-archive@avro.apache.org Received: (qmail 40062 invoked by uid 500); 10 Jun 2010 22:44:43 -0000 Mailing-List: contact user-help@avro.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@avro.apache.org Delivered-To: mailing list user@avro.apache.org Received: (qmail 40054 invoked by uid 99); 10 Jun 2010 22:44:43 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Jun 2010 22:44:43 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [64.78.17.16] (HELO EXHUB018-1.exch018.msoutlookonline.net) (64.78.17.16) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 Jun 2010 22:44:34 +0000 Received: from EXVMBX018-1.exch018.msoutlookonline.net ([64.78.17.47]) by EXHUB018-1.exch018.msoutlookonline.net ([64.78.17.16]) with mapi; Thu, 10 Jun 2010 15:44:13 -0700 From: Scott Carey To: "user@avro.apache.org" Date: Thu, 10 Jun 2010 15:44:18 -0700 Subject: Re: schema defaults not reflected in generated objects (1.3.2) Thread-Topic: schema defaults not reflected in generated objects (1.3.2) Thread-Index: AcsI7nIgrJU8B0P6S/CGreFzf6PCWg== Message-ID: <7AADB8CB-E436-4C51-8B8D-A6ACE5E7E749@richrelevance.com> References: <4C0D5C47.7010002@dehora.net> <4C0D6E8A.8040907@dehora.net> <4C0D72EF.60707@apache.org> <4C0D77DB.1060408@apache.org> <4C1133BB.7070402@apache.org> <4C115FCE.2000706@apache.org> In-Reply-To: <4C115FCE.2000706@apache.org> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org On Jun 10, 2010, at 2:57 PM, Doug Cutting wrote: > Re boxing: Good idea to minimize it. >=20 >=20 > Also, we don't need to store a class in the enum. At code generation=20 > time we can simply write, e.g., Integer.class wherever you put INT.type. Good point, this is generated code, if we want to replace Integer.class w= ith AvroSuperInteger.class it would be hidden from the API. > Re storing type: this is a time/space tradeoff. If you store the union=20 > type in each record then you can write instances faster, since you can=20 > use type.ordinal(), and also dispatch switch statements for the value=20 > faster in user code. I guess we'll have to see what the time/space tradeoff looks like with some= testing later. For writing, its still a switch on the enum without the persisted type, som= ething like BarType bt =3D getBar$Type(); write(bt.ordinal()); switch(bt) { case INT: writeInt(getBar$Int()) ... case FLOAT: writeFloat(getBar$Float()) ... } The getBar$Type() requires the cascaded if/else though. And the get has a = conditional on getClass(). The latter should be very fast, the former less= so. This could be inverted to make the datum class be responsible for the writi= ng too: // note, thread safe writeBarUnion(Encoder e) { Object bar =3D this.bar; Class c =3D bar.getClass(); BarType bt =3D getBar$Type(c); // private helper writeInt(bt.ordinal()); switch(bt) { case INT: writeInt((int)bar); break; case FLOAT: writeFloat((float)bar); break; } } It might even make sense to just go all the way and add a=20 writeTo(Encoder e); Though that has some implications for lots of other stuff -- anything using= a DatumWriter. It does open op opportunities for optimizations and simpli= fication of union resolution without traversing the Schema. > A hash table lookup is probably not a lot faster=20 > than an if/then/elseif/... unless the union is really big. Personally=20 > I'd opt for performance over slightly smaller instances here. I suppose= =20 > we could offer both options, since it shouldn't alter the public API. >=20 Sometimes memory size is performance. It depends on whether you're streami= ng these in (size isn't a big deal) or holding on to them for a long time. I'd rather solidify the API though, and keep thinking through all the possi= bilities. As long as all these options are available with one API then min= or implementation detail options like this can be dealt with later. One other thing that would be very useful longer term would be to unify the= Specific and Reflect APIs. What I mean by that, is that the Specific API = should generate classes in such a way that the Reflect API would serialize = the objects to the same schema if asked to reflect on the object. This pro= bably means simply decorating the Specific objects with annotations. >=20 > Doug >=20 > On 06/10/2010 02:09 PM, Scott Carey wrote: >>=20 >> On Jun 10, 2010, at 11:49 AM, Doug Cutting wrote: >>=20 >>> On 06/10/2010 09:27 AM, Scott Carey wrote: >>>> I propose: >>>> * We use getters/setters, and allow setters and constructors to be >>>> package protected with an option so the objects can be safely passed >>>> around outside the package without stateful wrapper objects. Users >>>> can write factory methods or other static helper methods to abstract >>>> away as much as they would like from there. >>>> * Setters should not allow invalid data to be set. >>>=20 >>> I'd be okay with these changes. +1 >>>=20 >>>> * Provide some built-in mechanism to resolve unions on a read and >>>> verify them on a write. Perhaps a inner static enum for each union, >>>> and a getter for each branch. >>>=20 >>> So a schema like: >>>=20 >>> {"type":"record", "name":"Foo", "fields":[ >>> {"name":"bar", "type":["int", "float"]}]} >>>=20 >>> might generate something like: >>>=20 >>> public class Foo { >>> public enum BarType { INT, FLOAT } >>> private BarType bar$Type; >>> private Object bar; >>> public void setBar(int value) { bar =3D value; bar$Type =3D INT; } >>> public void setBar(float value) { bar =3D value; bar$Type =3D FLOAT; = } >>> public int getBar$Int() { >>> if (bar$Type !=3D INT) throw new ...; >>> return (int)bar; >>> } >>> public float getBar$Float() { >>> if (bar$Type !=3D FLOAT) throw new ...; >>> return (float)bar; >>> } >>> } >>>=20 >>=20 >> I was trying to think of a way to storing the type in there to keep the = memory footprint minimized. >> The getter name disambiguation needs a good format, and the $ delimiter = above would work well and be clear to users. >> However, I was thinking something more along these lines: >>=20 >> {code} >> public class Foo { >> // intrinsic simple types could be shared rather than in each class >> // named types must be per union. >> public enum BarType { >> INT(Integer.class), >> FLOAT(Float.class); >> private Class type; // internal use only >> BarType(Class type) { >> this.type =3D type; >> } >> } >> private Object bar; >> // Integer vs int: avoid auto-unbox, auto-rebox if the user has an Int= eger already >> public void setBar(Integer value) { bar =3D value; } >> // same, Float vs float >> public void setBar(Float value) { bar =3D value; } >> public Integer getBar$Int() { >> // we could return null instead of throwing. Return Integer, client= can un-box if needed >> if (bar.getClass() =3D=3D INT.type) return bar; else throw new ...; >> } >> public Float getBar$Float() { >> if (bar.getClass() =3D=3D FLOAT.type) return bar; else throw new ...= ; >> } >> public BarType getBar$Type() { >> // we could use a static ConcurrentHashMap, BarType> inst= ead of cascaded if/else if/ ... >> Class c =3D bar.getClass(); >> if (INT.type =3D=3D c) { >> return INT; >> } else if (FLOAT.type =3D=3D c) { >> return FLOAT; >> } else { >> throw new NeverHappensException(); // :) >> } >> } >> {code} >>=20 >> I think the above is thread-safe with no race conditions, though if one = thread is calling the setter toggling the type another thread calling getBa= r$Type might see a 'delayed' type. >>=20 >> Since the ENUM is exposed to the client, it also needs a regular naming = pattern so that schema evolution doesn't change the method names. >>=20 >> A user accessing an enum can then switch on the enum and call the approp= riate method. >>=20 >>=20 >>> Is that the sort of thing you have in mind? I've added dollar signs to >>> avoid potential conflicts, e.g., if the user had a field named 'barType= ' >>> or 'barInt'. I think its best to always add the dollars, since >>> otherwise it could get awkward if a field named 'barInt' were added lat= er. >>>=20 >>>> * Unions with one type and NULL should not translate to Object, but >>>> rather be that type and allow null. >>>=20 >>> That's already the case, no? >>>=20 >>> Also, you didn't provide a proposal for constructors, yet, right? >>=20 >> Yes, I'll need a more complicated example to discuss that one appropriat= ely. A pattern that aids in business object<> Avro translation, especiall= y in the presence of class hierarchies and a couple different object compos= ition patterns. A Union as a representation of a subclass is tricky. So is= it when modeling composition. >>=20 >>> Perhaps all of the setters could be on a generated builder class, with >>> an instance only returned when it's completely populated? >>=20 >> Providing builder classes rather than constructors could be extremely us= eful to aid in encapsulation. It should address my constructor concern and= remain flexible. >>=20 >>=20 >>>=20 >>> Doug >>=20