avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <sc...@richrelevance.com>
Subject Re: (de)serialization comparisons
Date Sun, 28 Nov 2010 08:02:56 GMT
Well, the code looks very inefficient and not a good test of how it is recommended to use Avro.

My hunch is that the test is not measuring serialization / deserialization, its measuring
the cost of instantiating and initializing various objects.  Many of these are very expensive
since they parse the string schema -- much like a regular expression in java, you want to
compile the expression once and run it many times -- in Avro you want to initialize the DatumWriters
and DatumReaders for a Schema once, and use it many times.

In the benchmark I recommend using System.nanoTime() since currentTimeMills() is not accurate
(its essentially +- 7ms for most OS's).

The encoding loop in the patch looks like:

+        for (int j = 0; j < mainLoop; ++j) {
+        	SpecificDatumWriter writer = new SpecificDatumWriter(ARangeSliceCommand.SCHEMA$);
+        	ByteArrayOutputStream bout = new ByteArrayOutputStream();
+        	BinaryEncoder enc = new BinaryEncoder(bout);
+        	for (ARangeSliceCommand cmd : acmds)
+        	{
+        		enc.writeString(new Utf8(cmd.getSchema().toString()));
+        		writer.write(cmd, enc);
+        	}
+            bout.close();
+            buf = bout.toByteArray();
+            writer = null;
+            bout = null;
+            enc = null;


Honestly, without looking too far I think this is benchmarking the toString() method of Schema.
 I think all objects in the acmds are the same, with the same schema, so that whole Utf8 instantiation
can move out of that loop and be done once.  Furthermore why is the schema being written for
each record?  Avro is designed to avoid that -- store it once somewhere and then write multiple
records with that schema.   
The patch doesn't make it easy to see how big 'mainLoop' is compared to the size of 'acmds',
if it is large and 'acmds' is small then datum writer creation is measured heavily too.

You need to cache the SpecificDatumWriter and SpecificDatumReader and re-use it for each schema.
 These are meant to be re-used and kept per record type.
Caching the BinaryEncoder and BinaryDecoder will help too if 'mainLoop' is not small.  "new
BinaryDecoder()" is deprecated, use the DecoderFactory and ideally re-use the instance when
possible.
A ThreadLocal WeakHashMap (or WeakIdentityHashMap) of Class >> SpecificDatumWriter would
work if you are concerned about thread safety or the object lifetimes.

The decoder loop is:
+        for (int j = 0; j < mainLoop; ++j) {
+        	BinaryDecoder dec = new BinaryDecoder(new ByteArrayInputStream(buf));
+        	SpecificDatumReader<ARangeSliceCommand> reader = new SpecificDatumReader<ARangeSliceCommand>(ARangeSliceCommand.SCHEMA$);
+        	for (int i = 0; i < acmds.size(); i++)
+        	{
+        		reader.setSchema(Schema.parse(dec.readString(new Utf8()).toString()));
+        		reader.read(new ARangeSliceCommand(), dec);
+        	}

First of all,  dec.readString() happily takes a null, either pass in a Utf8 you intend to
reuse -- one created outside of the loop for example, or pass in null.  
Next, this test is testing Schema.parse() performance.  This should be outside the loop. 
The schema of the written data should be kept once and not stored once per record.  Storing
the schema once with each record is an Avro anti-pattern.  If you are storing a schema with
every serialized record, you either using Avro wrong or have a use case that is not suitable
for Avro at all.  Avro has a verbose JSON schema, and a very compact binary recored representation.
 When a record type is serialized repeatedly, the schema need only be persisted once, and
the records with unique data persisted compactly.  There are many ways to do this that differ
depending on the use case.  Think about an RDBMS -- the equivalent of a table definition is
not stored with each row.  Storing an avro schema with each avro record is like storing a
table definition with each row.  If you have changing schemas over time, you need to store
only one schema per change somewhere, and have a mechanism to know which groups of records
have what schemas.  The canonical Avro example is the Avro Data File -- all records in it
have the same schema and the schema is stored in the header of the file.

The decoder bit should look more like:

BinaryDecoder dec = DecoderFactory.defaultFactory().createBinaryDecoder(buf, null); 
SpecificDatumReader<ARangeSliceCommand> reader = threadLocalReaders.get(ARangeSliceCommand.class);
// a Class >> SpecificDatumReader mapping.
reader.setSchema(Schema.parse(dec.readString(null).toString())); // schema stored once
ARangeSliceCommand record = new ARangeSliceCommand();
for (int j = 0; j < mainLoop; ++j) {  
  for (int i = 0; i < acmds.size(); i++)  {
    reader.read(record, dec); // or pass in null if there is no object to reuse, reusing is
faster
  }
}


And that would be very fast and additionally keep very few bytes per record on average assuming
mainLoop*admds.size() is large (at least a thousand).


On Nov 27, 2010, at 10:59 AM, Eric Evans wrote:

> https://issues.apache.org/jira/browse/CASSANDRA-1765


Mime
View raw message