hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Sammer <esam...@cloudera.com>
Subject Re: [VOTE] Direction for Hadoop development
Date Tue, 07 Dec 2010 03:36:58 GMT
I'm going to rather purposefully ignore larger questions like how the
ASF works or doesn't, veto usage, etc. I'm not well versed enough in
the Apache way to weigh in.

As someone who sees a lot of Hadoop clusters at many different
companies, I would like to see Hadoop's serialization system(s)
change. I think Hadoop should support interfaces to control
serialization plugin lifecycle and element serialization to / from an
abstract notion of a datum and bytes only. I would like to not mention
a serialization implementation by name in Hadoop proper, at all. A
single implementation to serve as a reference implementation makes
sense. To preserve backward compatibility and existing investment, it
makes sense for that to be Writable (whether we like it or not).
Additional implementations should be either "contrib" status (if
that's still an option) or externally managed (probably preferred due
to release cycle synchronization / update issues). The default
classpath should remain as free of mandatory external dependencies as
possible; library conflicts are still an extremely sore spot in Java
development at many sites I visit and forcing a large commercial
entity to use version X o something like Avro, Thrift, PB is almost a
non-starter for many.

If a PB / Thrift / Avro serialization implementation is part of
contrib or externally managed, it requires the user to understand this
dependency exists and manage the classpath. The precedent in my mind
is the scheduler situation; most folks run with either the cap or fair
schedulers but FIFO provides a default. If you opt to use one and it
comes with dependencies, that's your business. I think we can simplify
serialization plugin configuration via a classpath include system by
using something like run-parts or similar and the current
configuration system, but that's another issue.

In absence of an "opt in" serialization configuration pattern, we must
at least provide an "opt out." If a user uses thrift for their own MR
jobs internally, we shouldn't throw a monkey wrench into their life by
demanding it for core Hadoop. Provide them a means to de-configure
built in serialization impls and remove thrift from the classpath.

I'm a bit confused as to how this equates with sequence files being
deprecated or arrested. I tried to read HADOOP-6685 but there's a lot
of internal references and context I feel like I'm missing. Suffice it
to say, sequence files can *not* be broken for existing data for the
reasons everyone has stated. If we choose to focus development
elsewhere ("soft deprecate") or actively encourage users elsewhere
("@Deprecated") is an issue I think we can sever from this discussion.

tl;dr version:

- Don't break existing SequenceFiles.
- Serialization should be a richer interface to support plugin
lifecycle, serialize / deserialize only and be retrofitted using PB,
Avro, and Thrift as immediate consumer use cases. Serialization APIs
should be promoted to a(n officially) public, documented, API suited
to deal with modern serialization lib requirements.
- Common, HDFS, MR should contain as few mandatory external deps as
humanly possible because Java classloader semantics and a lack of
internal dep isolation is just kookoo for cocoa puffs. (Simplify it
and bring on our OSGI overlords.)
- We (non-committers / users / casual contributors) want only for
Hadoop to mature in features and stability, be an inviting community
to new potential contributors and users, and to be around for a long

Regards, respect, and thanks to all.

On Mon, Nov 29, 2010 at 5:30 PM, Owen O'Malley <oom@yahoo-inc.com> wrote:
> All,
>   Based on the discussion on HADOOP-6685, there is a pretty fundamental
> difference of opinion about how Hadoop should evolve. We need to figure out
> how the majority of the PMC wants the project to evolve to understand which
> patches move us forward. Please vote whether you approve of the following
> direction. Clearly as the author, I'm +1.
> -- Owen
> Hadoop has always included library code so that users had a strong
> foundation to build their applications on without needing to continually
> reinvent the wheel. This combination of framework and powerful library code
> is a common pattern for successful projects, such as Java, Lucene, etc.
> Toward that end, we need to continue to extend the Hadoop library code and
> actively maintain it as the framework evolves. Continuing support for
> SequenceFile and TFile, which are both widely used is mandatory. The
> opposite pattern of implementing the framework and letting each distribution
> add the required libraries will lead to increased community fragmentation
> and vendor lock in.
> Hadoop's generic serialization framework had a lot of promise when it was
> introduced, but has been hampered by a lack of plugins other than Writables
> and Java serialization. Supporting a wide range of serializations natively
> in Hadoop will give the users new capabilities. Currently, to support Avro
> or ProtoBuf objects mutually incompatible third party solutions are
> required. It benefits Hadoop to support them with a common framework that
> will support all of them. In particular, having easy, out of the box support
> for Thrift, ProtoBufs, Avro, and our legacy serializations is a desired
> state.
> As a distributed system, there are many instances where Hadoop needs to
> serialize data. Many of those applications need a lightweight, versioned
> serialization framework like ProtocolBuffers or Thrift and using them is
> appropriate. Adding dependences on Thrift and ProtocolBuffers to the
> previous dependence on Avro is acceptable.

Eric Sammer
twitter: esammer
data: www.cloudera.com

View raw message