hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: [PROPOSAL] new subproject: Avro
Date Fri, 03 Apr 2009 20:02:08 GMT
George Porter wrote:
> While this representation would certainly be as compact as possible, 
> wouldn't it prevent evolving the data structure over time?  One of the 
> nice features of Google Protocol Buffers and Thrift is that you can 
> evolve the set of fields over time, and older/newer clients can talk to 
> older/newer services.  If the proposed Avro is evolvable, then perhaps 
> I'm misunderstanding your statement about the lack of IDs in the 
> serialized data.

Avro supports schema evolution.  In Avro, the schema used to write the 
data must be available when the data is read.  (In files, it is 
typically stored in the file metadata.)

If you have the schema that was used to write the data, and you're 
expecting a slightly different schema, then you simply keep those fields 
that are in both schemas and skip those not.  This is equivalent to 
Thrift and Protocol Buffer's support for schema evolution, but does not 
require manually assigning numeric field ids.

This feature can also be used to support projection.  If you have 
records with many large fields, but only need a single field in a 
particular computation, then you can specify an expected schema with 
only that field, and the runtime will efficiently skip all of the other 
fields, returning a record with just the single, expected field.

> I also agree with Bryan, in that it would be unfortunate to have two 
> different Apache projects with overlapping goals.

We already have both Thrift and Etch in the incubator, which have 
similar goals.  Apache does not attempt to mandate that projects have 
disjoint goals.  There are many ways to slice things, and Apache prefers 
to rely on survival of the fittest rather than forcing things together.

> Regardless of 
> features, both protocol buffers and thrift have the advantage of being 
> debugged in mission-critical production environments.

Yes, but, as I've argued in other messages in this thread, they do not 
support the dynamic features we need.  Adding those features would add 
new code that would share little with existing code in those projects. 
So, while the projects are conceptually similar, the implementations are 
necessarily different, and, without significant code overlap, separate 
projects seem more natural.


View raw message