hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Owen O'Malley <omal...@apache.org>
Subject Re: [VOTE] Direction for Hadoop development
Date Tue, 07 Dec 2010 08:13:58 GMT

On Dec 6, 2010, at 7:36 PM, Eric Sammer wrote:

    Since this is mostly technical, it probably should be on the  
h-6685 jira instead of general@hadoop.

> I think Hadoop should support interfaces to control
> serialization plugin lifecycle and element serialization to / from an
> abstract notion of a datum and bytes only.

The core of my h-6685 patch updates the API to replace the typename  
with a serialization name and serialization specific metadata. That  
metadata is a set of bytes that are defined by the serialization. The  
typename alone is insufficient for Avro and having additional metadata  
will be useful for the other serializations as well.

Doug suggested that I add a user-friendly pair of methods and I did.  
While they are redundant, the set of serializations isn't expected to  
be large and therefore the extra code isn't much.

> I would like to not mention
> a serialization implementation by name in Hadoop proper, at all.

My patch removes some of the lingering references to Writables in  
SequenceFile, MapFile, etc. and moves them over the generic  
serialization API. The framework will likely continue depend on  
whichever serialization is used for RPC. Currently that is Writables,  
but will likely transition to either Avro or ProtoBuf in the coming  

> A
> single implementation to serve as a reference implementation makes
> sense.

A critical part of Hadoop's usability comes from its framework  
combined with library code that allows users to get the desired  
functionality without writing it themselves. Sure, it is easy to write  
a hash table yourself, but it is far easier to use the one bundled  
with Java.

> The default
> classpath should remain as free of mandatory external dependencies as
> possible; library conflicts are still an extremely sore spot in Java
> development at many sites I visit and forcing a large commercial
> entity to use version X o something like Avro, Thrift, PB is almost a
> non-starter for many.

I discussed this problem in the jira, but either the MapReduce user is  
using the X library or doesn't care the version of X. If they are  
using it, it is far more convenient to have the serialization on the  
classpath. There is a missing feature that we need to address to put  
the user's files ahead of the system ones. I'll file a jira for that.

It might also make sense for us to shade some of our dependencies, but  
that is a much bigger issue and is far from clear cut.

> If a PB / Thrift / Avro serialization implementation is part of
> contrib or externally managed, it requires the user to understand this
> dependency exists and manage the classpath.

The goal is to make Hadoop useful out of the box. If we make it so  
that Hadoop is only useful once it is bundled with 15 other projects,  
that is good for people who sell distributions that include Hadoop,  
but not for the project.

> I think we can simplify
> serialization plugin configuration via a classpath include system by
> using something like run-parts or similar and the current
> configuration system, but that's another issue.

The current patch loads the serialization plugins based on the  
configuration. If you don't want to support thrift, don't configure  
it. The same holds true of the other serializations, even writable.

> I'm a bit confused as to how this equates with sequence files being
> deprecated or arrested.

Doug vetoed my patch partially based on his assertion that  
SequenceFiles should be deprecated and that Hadoop should just be the  
framework with no library code.

>  If we choose to focus development
> elsewhere ("soft deprecate") or actively encourage users elsewhere
> ("@Deprecated") is an issue I think we can sever from this discussion.

At this point the PMC has supported continuing to invest in developing  

> - Don't break existing SequenceFiles.

That goes without saying, everyone has petabytes of data in them.

> - Common, HDFS, MR should contain as few mandatory external deps as
> humanly possible because Java classloader semantics and a lack of
> internal dep isolation is just kookoo for cocoa puffs. (Simplify it
> and bring on our OSGI overlords.)

That is a much bigger discussion that we should probably have. There  
are costs on both sides in terms of debugging and understandability.  
In particular, in most cases we are much better off using a library  
that has the right functionality that re-implementing it ourselves.

> - We (non-committers / users / casual contributors) want only for
> Hadoop to mature in features and stability, be an inviting community
> to new potential contributors and users, and to be around for a long
> time.

I want that too.

-- Owen

View raw message