hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: [PROPOSAL] new subproject: Avro
Date Fri, 03 Apr 2009 17:28:03 GMT
Bryan Duxbury wrote:
> It sounds like what you want is the option avoid pre-generated classes.

That's part of it.  But, once you have the schema, you might as well 
take advantage of it.

With the schema in hand, you don't need to tag data with field numbers 
or types, since that's all there in the schema.  So, having the schema, 
you can use a simpler data format.

Also, with the schema, resolving version differences is simplified. 
Developers don't need to assign field numbers, but can just use names. 
For performance, one can internally use field numbers while reading, to 
avoid string comparisons, but developers need no longer specify these, 
but can use names, as in most software.  Here having the schema means we 
can simplify the IDL and its versioning semantics.

> If that's the only thing you need, it seems like we could bolt that on 
> to Thrift with almost no work.

Would you write parsers for Thrift's IDL in every language?  Or would 
you use JSON, as Avro does, to avoid that?

Once you're using a different IDL and a different data format, what's 
shared with Thrift?  Fundamentally, those two things define a 
serialization system, no?

> I assume you'd have the schema stored in 
> metadata or file header or something, right? (You wouldn't want to store 
> the field names in the binary encoding as strings, since that would 
> probably very quickly dwarf the size of the actual data in a lot of cases.)

Yes, in data files the schema is typically stored in the metadata.

> If my assumptions are correct, it seems like it'd be a lot smarter to 
> leverage existing Thrift infrastructure and encoding work rather than 
> duplicating it for this lone feature.

What specific shared infrastructure would be leveraged?  For Hadoop's 
RPC, I hope to adapt Hadoop's client and server implementations as a 
transport, as these have been highly tuned for Hadoop's performance 


View raw message