avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: user experience
Date Wed, 02 Sep 2009 17:05:46 GMT
Scott Carey wrote:
> Decoding 3MB/sec seems rather slow to me (121MB log file instantiated to
> objects in ~40 secs).  For comparison, creating tuple objects from a Hadoop
> SequenceFile is ~5x faster.  Granted I'm comparing apples to oranges (my
> objects in SequenceFile to Eelco's test in Avro).
> This would depend on a lot on the objects themselves, the schema, and
> generic vs. specific, etc.

FWIW, in microbenchmarks, accessing fields via reflection is around 100x 
slower than normal field access!  That makes the reflect API generally 
much slower than generic and specific.

Reflect is also a bit tricky to use, since you need to define classes 
whose fields Avro knows how to serialize: the reflect API cannot infer 
an Avro schema for every Java class, but rather only for a stylized 
subset of classes (which needs to be better documented, AVRO-35).

I've found that generating classes with the specific API is both simpler 
and faster.  In particular, if you have a set of related classes, use a 
method-free protocol file (.avpr) to define them.  The Java classes are 
generated by an Ant task.  For example, see the patch I attached to the 
following issue:


The "schemata" Ant target generates a file under build/src named 
Events.java that contains nested classes for each type defined in 
Events.avpr. (That target would better be named "generate-avro-classes".)

Note that specific's generated code does not currently have constructors 
or accessor methods.  Instead all fields are public, so, to build an 
instance you create it with something like 'Foo foo = new Foo();' then 
set all its fields with things like 'foo.a = ...;".  If this proves too 
cumbersome, we could generate a constructor that includes all fields.  I 
don't see a big need for accessor methods: a public setter and getter is 
equivalent to a public field.  The only advantage accessors would add is 
if you might someday wish to replace the class with a non-Avro-generated 
implementation, change the fields, keep the accessor methods and 
serialize it manually or with reflection.  This does not seem like a 
likely scenario to me, and it's nice to keep the generated code small.

The primary downside of using the specific API is that you can't add 
extra methods, etc. to the generated classes.  You need to treat them 
just as dumb structs, and keep all application logic external to them. 
In practice I don't think this is a big limitation, however.


View raw message