hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chad Walters <chad_walt...@yahoo.com>
Subject Re: [PROPOSAL] new subproject: Avro
Date Tue, 07 Apr 2009 08:56:59 GMT

Doug,

> I have never said I was not interested in working together.

That's great -- glad to hear that you are open to collaboration. My concern is that by making
a separate (sub)project, however, it may be difficult for us to work together in practice,
and in particular it may be difficult for Thrift to leverage Avro's source code.

> I've
said that I think Avro is fundamentally different from Thrift.
> Avro is
a specific format, Thrift is a generic API for various formats, none
like Avro
> They might be made to work together.  But at this point I
see no point in forcing
> them together.

I don't think that they are as far apart as you are making it sound with this statement. I
do think, however, that it will be very difficult for them to work together properly if the
goal of code reuse by Thrift is not an explicit goal of Avro. The easiest way I can come up
with to guarantee this is simply to incorporate Avro's feature set into Thrift. If you have
other mechanisms for doing this, I'd love to hear them.

> If TProtocol's API is a good
match for Avro's format and features,
> then it should be easy for folks
to implement TProtocol using Avro's code and
> include Avro in Thrift. 
If the match is not good then perhaps we can adjust Thrift
> and/or Avro
to improve it.

Absolutely. And right now, there are sufficient differences in the type system and other areas
that do require some adjustments, likely some on both sides (although, as I said in my previous
email, we need to account for the fact that Thrift has current users to support so backwards-compatibility
will need to be a consideration).

> Communities form around code, and, if Avro's code is largely disjoint
> from Thrift's, we should not assume that everyone in the Thrift
> community cares about Avro or vice versa.

IMO communities form around shared goals and purposes. Code and designs are created to achieve
those purposes; they are also malleable and can be bent to achieve new goals and purposes.
If we can find common cause, then we form a common community.

You have some features that you want to satisfy for Hadoop's purposes: compact serialization
of large files containing many records of identical structure; partial deserialization in
support of projection; dynamic interpretation of object schemas; better/more efficient RPC
-- all delivered across multiple languages. The first three are also use cases that are of
interest to some portion of the Thrift community and the fourth is something that Thrift already
provides.

Avro at this point is fairly nascent -- you have a design, some code, a couple of developers,
and a target group of future users who seem very receptive to what you are working on. You
do not have current users, however, and that should mean that you have some degree of flexibility
to your design where it doesn't make a material difference to the use cases you are trying
to solve.

If you are willing to make some modifications to that design and
code, the work on Avro could also work directly towards extending
Thrift's functionality. I am pretty certain that the Thrift community
would be willing to make some reasonable modifications and extensions
to Thrift to smooth the way for this as well.

I think that by working closely with the Thrift community directly in the Thrift code base,
you will get several significant benefits. You will be able to directly leverage the transport
and server implementations in Thrift today and any future work in this area is also beneficial.
You will have a built-in set of developers and committers across many languages who are already
familiar with issues in cross-language serialization (and I agree with Kevin that this is
not as portable as you seem to think it is). You will be able to avoid writing lots of parts
of an RPC framework in multiple languages that you would need to write to make Avro a stand-alone
solution for Hadoop. You would have a significant role in shaping the direction of Thrift
to make sure that it remains a strong solution for Hadoop.

> I've said that I think Avro is fundamentally different from Thrift.  Avro
> is a specific format, Thrift is a generic API for various formats, none like Avro.

It is clear to me that a slightly modified version of Avro's data format should fit just fine
as a Thrift TProtocol implementation. Out of the box this would, of course, only provide for
statically generated bindings, but this is enough to satisfy the first of the desired features
I described above.

The second feature, partial deserialization, is a feature that I would like to see in Thrift
for a variety of use cases, not just your projection use case -- for example, message routing
where only a message header is deserialized to determine where to pass along an otherwise
uninterpreted block of data. This feature is not tightly coupled to the Avro data format in
any way. As you have stated, this is possible to do when you have the schema in hand. Note
that he static bindings in Thrift are another way that the schema can be transmitted -- in
fact, the whole schema could just be retrievable from the bindings directly and fed into whatever
mechanism is availabe for dynamic interpretation. But we wouldn't have to go so far as that
for field look up by name -- as Kevin pointed out, the Java and Ruby Thrift libraries already
have mechanisms for sufficient introspection to accomplish the right kind of lookups, I believe,
and the other libraries could be
 extended to do the same quite easily. So partial deserialization can be supported via either
dynamic interpretation and/or via introspection features of the static bindings.

To support the second use case, dynamic schema interpretation, there is definitely significant
new code to be written. Note that this code is essentially the same code wherever you are
writing it. Whatever work you are doing in Avro to be able to dynamically interpret JSON IDL
could just be directly implemented in Thrift -- we would just define a JSON version of the
Thrift IDL which would look a lot like Avro's IDL. To help further with interoperability we
could make the Thrift compiler generate the JSON IDL from the Thrift IDL as another output
target.

The basic upshot of the above is that it is not that hard to see how Avro could be directly
integrated into Thrift if you were willing to entertain that option and I believe that you
would see significant benefits that would more than offset the impact to your own ease of
development about which you expressed concerns.

To touch on a couple specific responses from your previous email to me:

>> If the field IDs are really so
>> objectionable, Thrift could allow them to be optional for purely
>> dynamic usages.
>
> Optional features increase compatibility complexity and are harder
> to maintain and test. A Thrift IDL without numbers would not provide
> versioning features to non-dynamic languages.

Let me rephrase my suggestion because I think I may not have put it across as clearly as I
could have. I am proposing that the IDL would only allow for field IDs to be omitted in the
case where the schema was being interpreted dynamically -- no static bindings could be generated
from IDL without fully specified field IDs. So if you are only interested in dynamic interpretation,
you never have to look at or even think about field IDs. Does that in any way alter your stance
here?

> It could be a floor wax and a dessert topping!

Love the SNL reference, but I don't think it is really appropos. My vision for Thrft with
Avro's features folded in as a unified framework for cross-language serialization, covering
a variety of use cases, is not jamming two completely heterogeneous things together. I can
easily see wanting to take structures represented in one serialization format from disk and
send them out over RPC. Thrift provides the means to do this kind of thing seemlessly, with
formats appropriate to both use cases, rather than selecting a format that is good for one
use case and so-so for the other.

Chad



----- Original Message ----
From: Doug Cutting <cutting@apache.org>
To: general@hadoop.apache.org
Sent: Monday, April 6, 2009 9:15:01 PM
Subject: Re: [PROPOSAL] new subproject: Avro

Kevin Clark wrote:
> The overhead for those people (or some
> equivalent group) to pay attention to another mailing list, another
> bug tracker, another irc channel, and another community isn't trivial.

Communities form around code, and, if Avro's code is largely disjoint from Thrift's, we should
not assume that everyone in the Thrift community cares about Avro or vice versa.

> Of course, this assumes that one of the primary goals of Avro is to be
> cross language. Is that the case, or have I misunderstood?

Yes, that is a goal.

> It would be perfectly reasonable for Hadoop to specify that they
> use the Avro data format for transmissions, and the cross language
> library to provide the API could be Thrift. I think you said something
> similar in your post, but if not please do clarify.

Yes, perhaps this could be done.  I am not convinced that TProtocol is an ideal API for reading
and writing Avro data, but it could perhaps be made to work reasonably well.

> That being said, I'm fairly confident we'll be providing an Avro
> protocol on our own at some point if you're not interested in working
> together. But I think if we go down that path we're doing a disservice
> to users of both Thrift and Avro.

I have never said I was not interested in working together.  I've said that I think Avro is
fundamentally different from Thrift.  Avro is a specific format, Thrift is a generic API for
various formats, none like Avro.  They might be made to work together.  But at this point
I see no point in forcing them together.  If TProtocol's API is a good match for Avro's format
and features, then it should be easy for folks to implement TProtocol using Avro's code and
include Avro in Thrift.  If the match is not good then perhaps we can adjust Thrift and/or
Avro to improve it.

Doug


Mime
View raw message