hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chad Walters <chad_walt...@yahoo.com>
Subject Re: [PROPOSAL] new subproject: Avro
Date Wed, 08 Apr 2009 07:03:33 GMT


After our off-list chat, and given that you have indicated that the design is still in flux
and that you are open to discussing changes that would permit interoperability, I am not as
concerned as I was.

My urgency came from concern that once the design was put in place as part of an Apache subproject,
rather than open sourced in some other less prominent forum, it would increase the barrier
to interoperability; in particular, I was concerned that people would assume the design of
the data format was fully-baked and start persisting large amounts of data in some early version
of the format, potentially prematurely ossifying the design in a state unsuited for compatibility
with Thrift. Given your clarifications around this, my fears clearly were not well-founded.

Please accept my apology if I came across as obstructionist. I was honestly advocating on
behalf of what I believe is in the best interest our shared user base.

Clearly we have some disagreements about the value of some of Thrift's design choices and
what those mean for various use cases. I think we also have some differences of opinion about
the relative difficulty of implementation versus the value of interoperability. Hopefully,
the next few months will afford an opportunity to examine the sources of those disagreements
and see if they can be resolved.



----- Original Message ----
From: Doug Cutting <cutting@apache.org>
To: general@hadoop.apache.org
Sent: Tuesday, April 7, 2009 8:33:32 PM
Subject: Re: [PROPOSAL] new subproject: Avro

To be clear, since a few folks have missed this point: Avro is not complete.  At some point
in the future, before people start using it as a format for persistent data, we'll need to
stop altering its specification, or at least do so much more cautiously.  But before then,
my immediate goal to move development from private to open so that we have a chance to incorporate
feedback before we lock down the specification.

For example, several folks have raised the issue of compatibility with Thrift.  We certainly
want to avoid gratuitous incompatibilities.  There are also features clearly missing from
Avro that we expect to add before we make a release, like default values, a more efficient
RPC handshake, etc.  And some features that we might consider removing, if they're not broadly
useful and inhibit interoperability, like single-float, which isn't in Thrift, Python, etc.
 And I expect there will be more such issues raised in the coming weeks and months.

But before we can discuss and resolve such issues we need a forum in which to do so.  That's
all I am after at this point: mailing lists, a bug database, a public source code repository,
etc., so that we can start accepting patches, adding committers, etc.

Three days have now passed since I initially proposed this, the nominal time for an Apache
vote.  Is there anyone who strongly opposes taking the development of Avro public as a Hadoop
subproject?  Only PMC votes are binding, but I would vastly prefer that the broader community
also supports this step in the process.



Doug Cutting wrote:
> I propose we add a new Hadoop subproject for Avro, a serialization system.  My ambition
is for Avro to replace both Hadoop's RPC and to be used for most Hadoop data files, e.g.,
by Pig, Hive, etc.
> Initial committers would be Sharad Agarwal and me, both existing Hadoop committers. 
We are the sole authors of this software to date.
> The code is currently at:
> http://people.apache.org/~cutting/avro.git/
> To learn more:
> git clone http://people.apache.org/~cutting/avro.git/ avro
> cat avro/README.txt
> Comments?  Questions?
> Doug

View raw message