avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <sc...@richrelevance.com>
Subject Re: schema defaults not reflected in generated objects (1.3.2)
Date Thu, 10 Jun 2010 22:44:18 GMT

On Jun 10, 2010, at 2:57 PM, Doug Cutting wrote:

> Re boxing: Good idea to minimize it.
> Also, we don't need to store a class in the enum.  At code generation 
> time we can simply write, e.g., Integer.class wherever you put INT.type.
  Good point, this is generated code, if we want to replace Integer.class with AvroSuperInteger.class
it would be hidden from the API.

> Re storing type: this is a time/space tradeoff.  If you store the union 
> type in each record then you can write instances faster, since you can 
> use type.ordinal(), and also dispatch switch statements for the value 
> faster in user code.

I guess we'll have to see what the time/space tradeoff looks like with some testing later.

For writing, its still a switch on the enum without the persisted type, something like

BarType bt = getBar$Type();
switch(bt) {
case INT:
 writeInt(getBar$Int()) ...
case FLOAT:
 writeFloat(getBar$Float()) ...

The getBar$Type() requires the cascaded if/else though.  And the get has a conditional on
getClass().  The latter should be very fast, the former less so.
This could be inverted to make the datum class be responsible for the writing too:

// note, thread safe
writeBarUnion(Encoder e) {
  Object bar = this.bar;
  Class<?> c = bar.getClass();
  BarType bt = getBar$Type(c); // private helper
  switch(bt) {
  case INT:
   writeInt((int)bar); break;
  case FLOAT:
   writeFloat((float)bar); break;

It might even make sense to just go all the way and add a 

writeTo(Encoder e);

Though that has some implications for lots of other stuff -- anything using a DatumWriter.
 It does open op opportunities for optimizations and simplification of union resolution without
traversing the Schema.

>  A hash table lookup is probably not a lot faster 
> than an if/then/elseif/... unless the union is really big.  Personally 
> I'd opt for performance over slightly smaller instances here.  I suppose 
> we could offer both options, since it shouldn't alter the public API.

Sometimes memory size is performance.  It depends on whether you're streaming these in (size
isn't a big deal) or holding on to them for a long time.

I'd rather solidify the API though, and keep thinking through all the possibilities.  As long
as all these options are available with one API then minor implementation detail options like
this can be dealt with later.

One other thing that would be very useful longer term would be to unify the Specific and Reflect
APIs.  What I mean by that, is that the Specific API should generate classes in such a way
that the Reflect API would serialize the objects to the same schema if asked to reflect on
the object.  This probably means simply decorating the Specific objects with annotations.

> Doug
> On 06/10/2010 02:09 PM, Scott Carey wrote:
>> On Jun 10, 2010, at 11:49 AM, Doug Cutting wrote:
>>> On 06/10/2010 09:27 AM, Scott Carey wrote:
>>>> I propose:
>>>> * We use getters/setters, and allow setters and constructors to be
>>>> package protected with an option so the objects can be safely passed
>>>> around outside the package without stateful wrapper objects.  Users
>>>> can write factory methods or other static helper methods to abstract
>>>> away as much as they would like from there.
>>>> * Setters should not allow invalid data to be set.
>>> I'd be okay with these changes.  +1
>>>> * Provide some built-in mechanism to resolve unions on a read and
>>>> verify them on a write.  Perhaps a inner static enum for each union,
>>>> and a getter for each branch.
>>> So a schema like:
>>> {"type":"record", "name":"Foo", "fields":[
>>>   {"name":"bar", "type":["int", "float"]}]}
>>> might generate something like:
>>> public class Foo {
>>>   public enum BarType { INT, FLOAT }
>>>   private BarType bar$Type;
>>>   private Object bar;
>>>   public void setBar(int value) { bar = value; bar$Type = INT; }
>>>   public void setBar(float value) { bar = value; bar$Type = FLOAT; }
>>>   public int getBar$Int() {
>>>     if (bar$Type != INT) throw new ...;
>>>     return (int)bar;
>>>   }
>>>   public float getBar$Float() {
>>>     if (bar$Type != FLOAT) throw new ...;
>>>     return (float)bar;
>>>   }
>>> }
>> I was trying to think of a way to storing the type in there to keep the memory footprint
>> The getter name disambiguation needs a good format, and the $ delimiter above would
work well and be clear to users.
>> However, I was thinking something more along these lines:
>> {code}
>> public class Foo {
>>   // intrinsic simple types could be shared rather than in each class
>>   // named types must be per union.
>>   public enum BarType {
>>     INT(Integer.class),
>>     FLOAT(Float.class);
>>     private Class<?>  type; // internal use only
>>     BarType(Class<?>  type) {
>>       this.type = type;
>>     }
>>   }
>>   private Object bar;
>>   // Integer vs int: avoid auto-unbox, auto-rebox if the user has an Integer already
>>   public void setBar(Integer value) { bar = value; }
>>   // same, Float vs float
>>   public void setBar(Float value) { bar = value; }
>>   public Integer getBar$Int() {
>>     // we could return null instead of throwing.  Return Integer, client can un-box
if needed
>>     if (bar.getClass() == INT.type) return bar; else throw new ...;
>>   }
>>   public Float getBar$Float() {
>>     if (bar.getClass() == FLOAT.type) return bar; else throw new ...;
>>   }
>>   public BarType getBar$Type() {
>>     // we could use a static ConcurrentHashMap<Class<?>, BarType>   instead
of cascaded if/else if/ ...
>>     Class<?>  c = bar.getClass();
>>     if (INT.type == c) {
>> 	return INT;
>>     } else if (FLOAT.type == c) {
>>         return FLOAT;
>>     } else {
>>       throw new NeverHappensException();  // :)
>>     }
>> }
>> {code}
>> I think the above is thread-safe with no race conditions, though if one thread is
calling the setter toggling the type another thread calling getBar$Type might see a 'delayed'
>> Since the ENUM is exposed to the client, it also needs a regular naming pattern so
that schema evolution doesn't change the method names.
>> A user accessing an enum can then switch on the enum and call the appropriate method.
>>> Is that the sort of thing you have in mind?  I've added dollar signs to
>>> avoid potential conflicts, e.g., if the user had a field named 'barType'
>>> or 'barInt'.  I think its best to always add the dollars, since
>>> otherwise it could get awkward if a field named 'barInt' were added later.
>>>> * Unions with one type and NULL should not translate to Object, but
>>>> rather be that type and allow null.
>>> That's already the case, no?
>>> Also, you didn't provide a proposal for constructors, yet, right?
>> Yes, I'll need a more complicated example to discuss that one appropriately.  A pattern
that aids in business object<>  Avro translation, especially in the presence of class
hierarchies and a couple different object composition patterns. A Union as a representation
of a subclass is tricky.  So is it when modeling composition.
>>> Perhaps all of the setters could be on a generated builder class, with
>>> an instance only returned when it's completely populated?
>> Providing builder classes rather than constructors could be extremely useful to aid
in encapsulation.  It should address my constructor concern and remain flexible.
>>> Doug

View raw message