avro-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <sc...@richrelevance.com>
Subject Re: file format stable?
Date Fri, 12 Mar 2010 18:44:04 GMT
On Mar 12, 2010, at 9:35 AM, Tim Sell wrote:

> excellent! thanks for the response :)
> 
I have committed a large dataset to using the current format.  The current format will not
be abandoned.

The current format has its limitations.  It is optimized for larger numbers of smaller records
( ~ < 2K), and probably should not be used for records significantly larger than 1MB. 
Essentially, it is built for the more typical Hadoop processing use case as well as structured
data storage.

The main drawbacks are:
* Synchronous Logging -- the file is written in block size chunks, if one wants to commit
a record to disk as soon as possible, each record has to be its own block -- this is inefficient.
* Large records -- blocks are read in as a whole, and currently need to fit in memory in some
implementations (including Java).  We could relax this requirement for some compression codecs.
* Large records -- the final block size has to be known before writing, currently this is
done by buffering in memory while writing.
* One schema -- each file has one schema for all records within.  This is a very good simplification
for most needs, but one cannot merge or concatenate two files with different schemas, even
for the most minor schema difference. 

Use cases that push the boundaries above may require a new and different file format, or perhaps
some sort of extension to the current format.

-Scott

> On 12 March 2010 17:30, Doug Cutting <cutting@apache.org> wrote:
>> Tim Sell wrote:
>>> 
>>> But we're wondering if the file format is set in stone now
>> 
>> It should not change again.  It did not seem that any were yet using the
>> prior format, and it had some bad limitations, so we revised it.  If it ever
>> does change again, we would require implementations to be back-compatible,
>> still able to read the old format.
>> 
>> Doug
>> 


Mime
View raw message