avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <sc...@richrelevance.com>
Subject Re: sync interval for AvroOutputFormat
Date Sun, 19 Dec 2010 23:14:22 GMT

On Dec 18, 2010, at 1:05 PM, Joe Crobak wrote:

> AvroOutputFormat supports setting deflate level, but not the sync interval.
> Was this a conscious decision (i.e. would there be drawbacks of making the
> sync interval larger)?
> 
> In some tests that I've done, Avro data files were over 50% smaller when I
> upped the sync interval to 2MB (default is 16000 bytes).  I also saw a
> modest speedup in building the files (I suspect my program was IO-bound).
> 
> Would folks support a patch to add setting a sync interval as a static
> configuration option to AvroOutputFormat?

Yes, it makes sense to expose that.

Out of curiosity, how much of an improvement do you get for going to 64000 bytes?  A larger
default for the MapReduce case makes sense, but 2MB may be on the large side.  M/R has to
split the file at sync boundaries and you don't want those to end up too far from the HDFS
block boundaries.

The file format default is moderately sized because for many non M/R use cases, syncing to
disk more regularly is a good idea.  With the default deflate lookback window 32k, compression
ratio as a function of block size tends to have a sharp elbow near that size.  In my experiments,
 compression ratio did not go up after blocks that are about 120k in size, and was only moderately
better than 16000 byte blocks.  But my data isn't your data.
> 
> Best,
> Joe


Mime
View raw message