hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <sc...@richrelevance.com>
Subject Re: [PROPOSAL] new subproject: Avro
Date Fri, 03 Apr 2009 19:49:29 GMT
On 4/3/09 12:03 PM, "George Porter" <George.Porter@Sun.COM> wrote:

> On Apr 3, 2009, at 11:37 AM, Doug Cutting wrote:
>> Field ids are not present in Avro data except in the schema.  A
>> record's fields are serialized in the order that the fields occur in
>> the records schema, with no per-field annotations whatsoever.  For
>> example, a record that contains a string and an int is serialized
>> simply as a string followed by an int, nothing before, nothing
>> between and nothing after. So, yes, it is a different data format.
> While this representation would certainly be as compact as possible,
> wouldn't it prevent evolving the data structure over time?  One of the
> nice features of Google Protocol Buffers and Thrift is that you can
> evolve the set of fields over time, and older/newer clients can talk
> to older/newer services.  If the proposed Avro is evolvable, then
> perhaps I'm misunderstanding your statement about the lack of IDs in
> the serialized data.

>From a quick perusal of the serialization format -- it contains headers with
type/schema information, and other metadata blocks.  The types can be
inferred from this, and if this is done right then older/newer clients will
be able to read things just fine.  What can't be done is mixing two
different formats in the same stream if headers define the format of the
whole stream.

I have not looked much deeper than that, but it looks like schema evolution
is feasible.

> I also agree with Bryan, in that it would be unfortunate to have two
> different Apache projects with overlapping goals.  Regardless of
> features, both protocol buffers and thrift have the advantage of being
> debugged in mission-critical production environments.
> -George

View raw message