hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: [PROPOSAL] new subproject: Avro
Date Mon, 06 Apr 2009 19:12:32 GMT
Chad Walters wrote:
> -- You suggest that there is not a lot in Thrift that Avro can
> leverage. I think you may be overlooking the fact that Thrift has a
> user base and a community of developers who are very interested in
> issues of cross-language data serialization and interoperability.

I meant that in terms of common code, not coders.  Coders can belong to 
more than one community but code should generally not.  Hadoop Core has 
become a sprawling community that we're trying to split.  It's more 
productive to have have more, small communities than few large ones.  A 
project needs a handful of active developers, but too many and it 
becomes ungainly.  So, if it's technically possible for a codebase to be 
distinct, and it can attract enough active developers to sustain itself, 
that is a preferable structure.

> At the code level, Thrift contains a transport abstraction and
> multiple different transport and server implementations in many
> different target languages. If there were closer collaboration, Avro
> could certainly benefit from leveraging the existing ones and any
> additional contributions in this area would benefit both projects.

The transport and server implementations are indeed an area where code 
could potentially be shared between Avro and Thrift.  Perhaps someone 
could start a separate project with reusable transport and server 
implementations to support RPC?  In any case, Avro primarily specifies a 
binary message format, not a full transport.  We hope to piggyback off 
other transport implementations, like HTTP servers, etc.  Full 
transports involve authentication, authorization, encryption, etc., 
which are outside of the scope of Avro.

> The most significant issue is that both of them specify a type
> system. At a very minimum I would like to see Avro and Thrift make
> agreements on that type system.

This makes good sense.  It would be good if these were interoperable.

Thrift has byte and i16, which Avro does not currently.  I'd like to add 
a fixed<n> primitive type to Avro, where n is the number of bytes and is 
specified in the schema, so that one could, e.g., define a byte as 
fixed<1>, i16 as a fixed<2> and md5 as a fixed<16>.

Thrift has both lists and sets, Avro has just arrays, which are 
equivalent to lists (they're ordered).  Perhaps Avro could add sets. 
Are they leveraged heavily in Thrift?  I've not heard much call for them 
in Avro yet.

Avro has single-float, Thrift does not.  Avro could perhaps lose this.

Avro distinguishes UTF-8 text strings from byte strings, while Thrift 
does not.  I am reluctant to lose this distinction.

Avro has unions and a null type, while Thrift does not.  Does Thrift 
support recursive data structures?

> Furthermore, you say that last part ("Thrift would have yet another
> serialization format...") like it is a bad thing... 

When faced with multiple programming and scripting languages, multiple 
serialization formats should be discouraged, or one ends up with 
multiplicative compatibility problems.  A single, primary data format 
would vastly simplify the Hadoop ecosystem.  Yes, folks need to be able 
to easily import and export data, but expecting scripts in arbitrary 
languages to be able to process data in arbitrary formats seems unwise.

> Note that it is
> an explicit design goal of Thrift to allow for multiple different
> serialization formats so that lots of different use cases can be
> supported by the same fundamental framework.

That's not a design goal of Avro, which intends to provide a single, 
well-specified, easy to implement serialization format.  This is not in 
conflict with Thrift, it's just a different goal.

> Also, doesn't Avro essentially contain "another serialization format
> that every language would need to implement for it to be useful"?
> Seems like the same basic set of work to me, whether it is in Avro or
> Thrift.

None of Thrift's existing formats solve the problems Avro seeks to. 
Thrift may be able to incorporate Avro's format, if it has good format 
generalizations, ideally using Avro's code.  So there should be little 
duplication of effort in such an approach.

> The simplification comes simply not having the field IDs in the IDL?
> I am not sure why having sequential id numbers after each field is
> considered to be so onerous.

I didn't say it was onerous, I said that, like in most data structure 
languages (e.g., programming languages), Avro permits folks to name 
fields with symbolic names alone.  In human-authored software, symbolic 
naming is generally preferable to numeric naming.  Is that really a 
matter of dispute?

> If the field IDs are really so
> objectionable, Thrift could allow them to be optional for purely
> dynamic usages.

Optional features increase compatibility complexity and are harder to 
maintain and test.  A Thrift IDL without numbers would not provide 
versioning features to non-dynamic languages.

> I also don't see why matching names is considered easier than
> matching numbers, which is essentially what the versioning semantics
> come down to in the end. Am I missing something here?

They are formally equivalent.  For machines, matching numbers is easier, 
but people usually prefer to operate on names, and names can be 
automatically mapped to numbers.

> Consider an alternative: making Avro more like a sub-project of
> Thrift or just implementing it directly in Thrift.

I looked into changing Thrift to support Avro's features, and it was 
very messy.  Perhaps someone else could do this more easily.

Building Avro as a part of Thrift would take considerably more effort 
for me and I think offer little more than it does separately.  If you 
feel differently, you are free to fork Avro, start a competitor, provide 
patches that integrate it into Thrift, or whatever.

> In that case, I
> think the end result will be a powerful and flexible "one-stop shop"
> for data serialization for RPC and archival purposes with the ability
> to bring both static and dynamic capabilities as needed for
> particular application purposes. To me this seems like a bigger win
> for both Hadoop and for Thrift.

It could be a floor wax and a dessert topping!

Doug


Mime
View raw message