hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From George Porter <George.Por...@Sun.COM>
Subject Re: [PROPOSAL] new subproject: Avro
Date Fri, 03 Apr 2009 20:24:05 GMT

On Apr 3, 2009, at 1:02 PM, Doug Cutting wrote:

> George Porter wrote:
>> While this representation would certainly be as compact as  
>> possible, wouldn't it prevent evolving the data structure over  
>> time?  One of the nice features of Google Protocol Buffers and  
>> Thrift is that you can evolve the set of fields over time, and  
>> older/newer clients can talk to older/newer services.  If the  
>> proposed Avro is evolvable, then perhaps I'm misunderstanding your  
>> statement about the lack of IDs in the serialized data.
> Avro supports schema evolution.  In Avro, the schema used to write  
> the data must be available when the data is read.  (In files, it is  
> typically stored in the file metadata.)
> If you have the schema that was used to write the data, and you're  
> expecting a slightly different schema, then you simply keep those  
> fields that are in both schemas and skip those not.  This is  
> equivalent to Thrift and Protocol Buffer's support for schema  
> evolution, but does not require manually assigning numeric field ids.
> This feature can also be used to support projection.  If you have  
> records with many large fields, but only need a single field in a  
> particular computation, then you can specify an expected schema with  
> only that field, and the runtime will efficiently skip all of the  
> other fields, returning a record with just the single, expected field.

Thanks for the clarification--I better understand the schema  
relationship.  The projection feature is a nice feature, especially  
since it seems like it would be able to support "sparse files" where  
you want to just peek at large structs without invoking a lot of disk- 
io (for data serialized on-disk).

>> I also agree with Bryan, in that it would be unfortunate to have  
>> two different Apache projects with overlapping goals.
> We already have both Thrift and Etch in the incubator, which have  
> similar goals.  Apache does not attempt to mandate that projects  
> have disjoint goals.  There are many ways to slice things, and  
> Apache prefers to rely on survival of the fittest rather than  
> forcing things together.
>> Regardless of features, both protocol buffers and thrift have the  
>> advantage of being debugged in mission-critical production  
>> environments.
> Yes, but, as I've argued in other messages in this thread, they do  
> not support the dynamic features we need.  Adding those features  
> would add new code that would share little with existing code in  
> those projects. So, while the projects are conceptually similar, the  
> implementations are necessarily different, and, without significant  
> code overlap, separate projects seem more natural.
> Doug

Makes sense.  Thanks,

View raw message