hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chad Walters <chad_walt...@yahoo.com>
Subject RE: [PROPOSAL] new subproject: Avro
Date Mon, 06 Apr 2009 07:23:18 GMT

Cross-posting to the Thrift dev and user lists since folks there may be interested in this.
It appears that my attempts to subscribe to general@hadoop.apache.org from my work email were
silently failing somewhere along the line -- I'll try not to take it personally. ;) Some others
have experienced this too -- so if you didn't get a subscription confirmation message, then
it failed. Try from a different address, I guess. You can view the thread here without being
subscribed: http://mail-archives.apache.org/mod_mbox/hadoop-general/200904.mbox/browser

Doug,

First, let me say that I think Avro has a lot of useful features -- features that I would
like to see fully supported in Thrift. At a minimum, I would like for us to be able to hash
out the details to guarantee that there can really be full interoperability between Avro and
Thrift. I am really interested in working cooperatively and collaboratively on this and I
am willing to put in significant time on design and communication to help make full interoperability
possible (I am unfortunately not able to contribute code directly at this time).

Second, I think all of this decision about where Avro should live requires more thought and
more discussion. I'd love to hear from more folks outside of Yahoo on this topic: so far all
of the +1 votes have come from Yahoo employees. I'd also love to hear from other folks who
have significant investments in both Thrift and Hadoop.

Some points to think about:

-- You suggest that there is not a lot in Thrift that Avro can leverage. I think you may be
overlooking the fact that Thrift has a user base and a community of developers who are very
interested in issues of cross-language data serialization and interoperability. Thrift has
committers with expertise in a pretty big set of languages and leveraging this could get Avro's
functionality onto more languages faster than the current path. Also, there is in fact significant
overlap between Hadoop users and Thrift users at this point, as well as significant use of
Thrift in more than one Hadoop sub-project.

At the code level, Thrift contains a transport abstraction and multiple different transport
and server implementations in many different target languages. If there were closer collaboration,
Avro could certainly benefit from leveraging the existing ones and any additional contributions
in this area would benefit both projects.

-- You also suggest that the two are largely disjoint from a technical perspective:
"Thrift fundamentally standardizes an API, not a data format.
Avro fundamentally is a data format specification, like XML."

I agree with the fundamental part but I think that doesn't bring to light enough of what is
in common and what is different for purposes of this discussion.

Thrift specifies a type system, an API for data formats and transport mechanisms, a schema
resolution algorithm, and provides implementations of several distinct data formats and transports.

Avro specifies a single data format but it also brings along several other things as well,
including a type system, specific RPC mechanism and a schema resolution algorithm.

The most significant issue is that both of them specify a type system. At a very minimum I
would like to see Avro and Thrift make agreements on that type system. The fact that there
is significant existing investment in the Thrift type system by the Thrift community should
weigh somewhere in this discussion. Obviously, the technical needs of Avro will also have
weight there, but where there is room for choice, the Thrift choices should be respected.
Arbitrary changes here will make it unnecessarily painful, perhaps impossible, for Thrift
to directly adopt Avro and instead Thrift will be forced to make an "Avro-like" data specification,
hampering interoperability for everyone.

There may be pitfalls in the other areas of overlap as well that would prevent real interoperability
-- let's elucidate them in further discussions.

-- Avro appears to have 3 primary features that Thrift does not currently support sufficiently:
1. Schema serialization allowing for compact representation of files containing large numbers
of records of identical types
2. Dynamic interpretation of schemas, which improves ease-of-use in dynamic languages (like
the Python Hadoop Streaming use case)
3. Lazy partial deserialization to support "projection"

Note that features 1 and 3 are independent of whether schemas are dynamicly interpreted or
compiled into static bindings.

WRT #1: Thrift's DenseProtocol goes some distance towards this although it doesn't go the
whole way. Thrift can easily be extended to further compact the DenseProtocol's wire format
for special cases where all fields are required. We have had significant discussions on the
Thrift list about doing more in this area previously but we couldn't get folks from Hadoop
who cared most about this use case to participate with us on capturing a complete set of requirements
and so there was no strong driver for it.

WRT #2: I totally understand the case you make for dynamic interpretation in ad hoc data processing.
I would love to see Thrift enhanced to do this kind of thing.

WRT #3: Partial deserialization seems like a really useful feature for several use cases,
not just for "projection". I think Thrift could and should be extended to support this functionality,
and it should be available for both static bindings and dynamic schema interpretation via
field names and field IDs where possible.

-- You state:
"Perhaps Thrift could be augmented to support Avro's JSON schemas and 
serialization.  Then it could interoperate with other Avro-based 
systems.  But then Thrift would have yet another serialization format, 
that every language would need to implement for it to be useful..."

First, that "Perhaps" hides a lot of complexity and unless that is hashed out ahead of time
I am pretty sure the real answer will be "Thrift cannot be augmented to support Avro directly
but instead could be augmented to support something that looks quite a bit like Avro but differs
in mostly unimportant ways." To me that seems like a shame.

Furthermore, you say that last part ("Thrift would have yet another serialization format...")
like it is a bad thing... Note that it is an explicit design goal of Thrift to allow for multiple
different serialization formats so that lots of different use cases can be supported by the
same fundamental framework. This is a clear recognition that there is no one-size-fits-all
answer for data serialization (fast RPC vs compact archival record data vs human readability,
to name a few salient use cases). For a compelling enough use case, there is no reason not
to port new protocols across multiple languages (generally done on an as-needed basis by someone
who wants that functionality in that language). Another great feature of the protocol abstraction
is that it allows data to be seamlessly moved from one serialization format to another as,
say, it is read out of archival storage and sent on as RPC.

Also, doesn't Avro essentially contain "another serialization format that every language would
need to implement for it to be useful"? Seems like the same basic set of work to me, whether
it is in Avro or Thrift.

-- You state:
"Avro fundamentally is a data format specification, like XML.  Thrift could implement this
specification.  The Avro project includes reference implementations, but the format is intended
to be simple enough and the specification stable enough that others might reasonably develop
alternate, independent implementations."

I think this is a bit inaccurate. First there is the issue of type system compatibility that
I raised above and the plausibility of satisfying that "could" without refinement and collaboration
on Avro's specification. Furthermore, stated goal of the subproject is "for Avro to replace
both Hadoop's RPC and to be used for most Hadoop data files". This will bring in quite a bit
beyond a reference implementation of a data format specification, especially depending on
how many languages you intend to build RPC support for (Java, Python, C++ all mentioned at
some point -- others?). I don't think it is unreasonable that the significant proportion of
folks in the Hadoop community who are also using Thrift are puzzled about why there isn't
more consideration being given to convergence between Avro and Thrift.

-- You state:
"Also, with the schema, resolving version differences is simplified. 
Developers don't need to assign field numbers, but can just use names. 
For performance, one can internally use field numbers while reading, to 
avoid string comparisons, but developers need no longer specify these, 
but can use names, as in most software.  Here having the schema means we 
can simplify the IDL and its versioning semantics."

The simplification comes simply not having the field IDs in the IDL? I am not sure why having
sequential id numbers after each field is considered to be so onerous. I honestly have never
heard a single Thrift user complain about this. Anyone doing more than just that is doing
something advanced that wouldn't be possible without the field IDs (like renaming a field).
I think having to deal with JSON syntax in the Avro IDL is actually more annoying for humans
than the application of field IDs, both with the added syntactic punctuation and the increased
verbosity. If the field IDs are really so objectionable, Thrift could allow them to be optional
for purely dynamic usages.

I also don't see why matching names is considered easier than matching numbers, which is essentially
what the versioning semantics come down to in the end. Am I missing something here?

-- You state:
"Would you write parsers for Thrift's IDL in every language?  Or would 
you use JSON, as Avro does, to avoid that?"

Here I totally agree with you: a JSON IDL is better for machine parsing than Thrift's current
IDL, which is targeted more at human parsing. And given that I agree that some form of dynamic
interpretation is a useful feature, I don't see any reason why a JSON version of the IDL couldn't
become part of the picture. Furthermore, the Thrift IDL compiler could easily be extended
to take this JSON format as both an input (in addition to the current Thrift IDL) and output.

An alternative is would just be to have the other languages bind to the Thrift IDL parser
directly -- most languages bind to C (granted for some it is easier than others) -- and get
back the parsed data structure to interpret off of.

-- By making Avro a sub-project of Hadoop, I believe you will succeed in producing an improved
version of Hadoop Record IO and a better RPC mechanism than the current Hadoop RPC. However,
I don't think that this will result in better general RPC than Thrift and it will certainly
be much less performant for RPC in a wide range of applications.

Consider an alternative: making Avro more like a sub-project of Thrift or just implementing
it directly in Thrift. In that case, I think the end result will be a powerful and flexible
"one-stop shop" for data serialization for RPC and archival purposes with the ability to bring
both static and dynamic capabilities as needed for particular application purposes. To me
this seems like a bigger win for both Hadoop and for Thrift.

Thanks for reading through to this point. I look forward to further discussion.

Chad

Mime
View raw message