Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: general@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of Chad.Walters@microsoft.com
 designates 131.107.115.214 as permitted sender)
From: Chad Walters <Chad.Walters@microsoft.com>
To: "general@hadoop.apache.org" <general@hadoop.apache.org>
Date: Sun, 5 Apr 2009 23:51:27 -0700
Subject: RE: [PROPOSAL] new subproject: Avro
Thread-Topic: [PROPOSAL] new subproject: Avro
Thread-Index: Acm2eIiMha+SR8YIcEuSMApUfi2fXgAC5Lx/
Message-ID: <C5FEF47F.90BAC%cwalter@microsoft.com>
In-Reply-To: <C5FEE114.90BA5%cwalter@microsoft.com>
Accept-Language: en-US
Content-Language: en
acceptlanguage: en-US
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Doug,

First, let me say that I think Avro has a lot of useful features -- feature=
s
that I would like to see fully supported in Thrift. At a minimum, I would
like for us to be able to hash out the details to guarantee that there can
really be full interoperability between Avro and Thrift. I am really
interested in working cooperatively and collaboratively on this and I am
willing to put in significant time on design and communication to help make
full interoperability possible (I am unfortunately not able to contribute
code directly at this time).

Second, I think all of this decision about where Avro should live requires
more thought and more discussion. I'd love to hear from more folks outside
of Yahoo on this topic: so far all of the +1 votes have come from Yahoo
employees. I'd also love to hear from other folks who have significant
investments in both Thrift and Hadoop.

Some points to think about:

-- You suggest that there is not a lot in Thrift that Avro can leverage. I
think you may be overlooking the fact that Thrift has a user base and a
community of developers who are very interested in issues of cross-language
data serialization and interoperability. Thrift has committers with
expertise in a pretty big set of languages and leveraging this could get
Avro's functionality onto more languages faster than the current path. Also=
,
there is in fact significant overlap between Hadoop users and Thrift users
at this point, as well as significant use of Thrift in more than one Hadoop
sub-project.

At the code level, Thrift contains a transport abstraction and multiple
different transport and server implementations in many different target
languages. If there were closer collaboration, Avro could certainly benefit
from leveraging the existing ones and any additional contributions in this
area would benefit both projects.

-- You also suggest that the two are largely disjoint from a technical
perspective:
"Thrift fundamentally standardizes an API, not a data format.
Avro fundamentally is a data format specification, like XML."

I agree with the fundamental part but I think that doesn't bring to light
enough of what is in common and what is different for purposes of this
discussion.

Thrift specifies a type system, an API for data formats and transport
mechanisms, a schema resolution algorithm, and provides implementations of
several distinct data formats and transports.

Avro specifies a single data format but it also brings along several other
things as well, including a type system, specific RPC mechanism and a schem=
a
resolution algorithm.

The most significant issue is that both of them specify a type system. At a
very minimum I would like to see Avro and Thrift make agreements on that
type system. The fact that there is significant existing investment in the
Thrift type system by the Thrift community should weigh somewhere in this
discussion. Obviously, the technical needs of Avro will also have weight
there, but where there is room for choice, the Thrift choices should be
respected. Arbitrary changes here will make it unnecessarily painful,
perhaps impossible, for Thrift to directly adopt Avro and instead Thrift
will be forced to make an "Avro-like" data specification, hampering
interoperability for everyone.

There may be pitfalls in the other areas of overlap as well that would
prevent real interoperability -- let's elucidate them in further
discussions.

-- Avro appears to have 3 primary features that Thrift does not currently
support sufficiently:
1. Schema serialization allowing for compact representation of files
containing large numbers of records of identical types
2. Dynamic interpretation of schemas, which improves ease-of-use in dynamic
languages (like the Python Hadoop Streaming use case)
3. Lazy partial deserialization to support "projection"

Note that features 1 and 3 are independent of whether schemas are dynamicly
interpreted or compiled into static bindings.

WRT #1: Thrift's DenseProtocol goes some distance towards this although it
doesn't go the whole way. Thrift can easily be extended to further compact
the DenseProtocol's wire format for special cases where all fields are
required. We have had significant discussions on the Thrift list about doin=
g
more in this area previously but we couldn't get folks from Hadoop who care=
d
most about this use case to participate with us on capturing a complete set
of requirements and so there was no strong driver for it.

WRT #2: I totally understand the case you make for dynamic interpretation i=
n
ad hoc data processing. I would love to see Thrift enhanced to do this kind
of thing.

WRT #3: Partial deserialization seems like a really useful feature for
several use cases, not just for "projection". I think Thrift could and
should be extended to support this functionality, and it should be availabl=
e
for both static bindings and dynamic schema interpretation via field names
and field IDs where possible.

-- You state:
"Perhaps Thrift could be augmented to support Avro's JSON schemas and
serialization.  Then it could interoperate with other Avro-based
systems.  But then Thrift would have yet another serialization format,
that every language would need to implement for it to be useful..."

First, that "Perhaps" hides a lot of complexity and unless that is hashed
out ahead of time I am pretty sure the real answer will be "Thrift cannot b=
e
augmented to support Avro directly but instead could be augmented to suppor=
t
something that looks quite a bit like Avro but differs in mostly unimportan=
t
ways." To me that seems like a shame.

Furthermore, you say that last part ("Thrift would have yet another
serialization format...") like it is a bad thing... Note that it is an
explicit design goal of Thrift to allow for multiple different serializatio=
n
formats so that lots of different use cases can be supported by the same
fundamental framework. This is a clear recognition that there is no
one-size-fits-all answer for data serialization (fast RPC vs compact
archival record data vs human readability, to name a few salient use cases)=
.
For a compelling enough use case, there is no reason not to port new
protocols across multiple languages (generally done on an as-needed basis b=
y
someone who wants that functionality in that language). Another great
feature of the protocol abstraction is that it allows data to be seamlessly
moved from one serialization format to another as, say, it is read out of
archival storage and sent on as RPC.

Also, doesn't Avro essentially contain "another serialization format that
every language would need to implement for it to be useful"? Seems like the
same basic set of work to me, whether it is in Avro or Thrift.

-- You state:
"Avro fundamentally is a data format specification, like XML.  Thrift could
implement this specification.  The Avro project includes reference
implementations, but the format is intended to be simple enough and the
specification stable enough that others might reasonably develop alternate,
independent implementations."

I think this is a bit inaccurate. First there is the issue of type system
compatibility that I raised above and the plausibility of satisfying that
"could" without refinement and collaboration on Avro's specification.
Furthermore, stated goal of the subproject is "for Avro to replace both
Hadoop's RPC and to be used for most Hadoop data files". This will bring in
quite a bit beyond a reference implementation of a data format
specification, especially depending on how many languages you intend to
build RPC support for (Java, Python, C++ all mentioned at some point --
others?). I don't think it is unreasonable that the significant proportion
of folks in the Hadoop community who are also using Thrift are puzzled abou=
t
why there isn't more consideration being given to convergence between Avro
and Thrift.

-- You state:
"Also, with the schema, resolving version differences is simplified.
Developers don't need to assign field numbers, but can just use names.
For performance, one can internally use field numbers while reading, to
avoid string comparisons, but developers need no longer specify these,
but can use names, as in most software.  Here having the schema means we
can simplify the IDL and its versioning semantics."

The simplification comes simply not having the field IDs in the IDL? I am
not sure why having sequential id numbers after each field is considered to
be so onerous. I honestly have never heard a single Thrift user complain
about this. Anyone doing more than just that is doing something advanced
that wouldn't be possible without the field IDs (like renaming a field). I
think having to deal with JSON syntax in the Avro IDL is actually more
annoying for humans than the application of field IDs, both with the added
syntactic punctuation and the increased verbosity. If the field IDs are
really so objectionable, Thrift could allow them to be optional for purely
dynamic usages.

I also don't see why matching names is considered easier than matching
numbers, which is essentially what the versioning semantics come down to in
the end. Am I missing something here?

-- You state:
"Would you write parsers for Thrift's IDL in every language?  Or would
you use JSON, as Avro does, to avoid that?"

Here I totally agree with you: a JSON IDL is better for machine parsing tha=
n
Thrift's current IDL, which is targeted more at human parsing. And given
that I agree that some form of dynamic interpretation is a useful feature, =
I
don't see any reason why a JSON version of the IDL couldn't become part of
the picture. Furthermore, the Thrift IDL compiler could easily be extended
to take this JSON format as both an input (in addition to the current Thrif=
t
IDL) and output.

An alternative is would just be to have the other languages bind to the
Thrift IDL parser directly -- most languages bind to C (granted for some it
is easier than others) -- and get back the parsed data structure to
interpret off of.

-- By making Avro a sub-project of Hadoop, I believe you will succeed in
producing an improved version of Hadoop Record IO and a better RPC mechanis=
m
than the current Hadoop RPC. However, I don't think that this will result i=
n
better general RPC than Thrift and it will certainly be much less performan=
t
for RPC in a wide range of applications.

Consider an alternative: making Avro more like a sub-project of Thrift or
just implementing it directly in Thrift. In that case, I think the end
result will be a powerful and flexible "one-stop shop" for data
serialization for RPC and archival purposes with the ability to bring both
static and dynamic capabilities as needed for particular application
purposes. To me this seems like a bigger win for both Hadoop and for Thrift=
.

Thanks for reading through to this point. I look forward to further
discussion.

Chad