hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Sammer <esam...@cloudera.com>
Subject Re: [VOTE] Direction for Hadoop development
Date Tue, 07 Dec 2010 18:08:29 GMT
Thanks for your response Owen. I'll common on the JIRA with my
opinion. I didn't want to muddy the existing conversation, but if it
helps to have user level input, I'm happy to throw my hat in the ring.

Just the summary version this time:

Non-technical:

- I believe we need to temper our goals of stability with the need for
growth and improvement. The project should be free to innovate. We all
agree on this. How we do that is the question to me. We should take a
(brief) step back to make a decision on that.
- We should reevaluate how most people view and are using Hadoop to
help us make these decisions. For instance, do people see Hadoop as a
turn key system that includes everything required or do they view it
as a framework for building custom data systems? What I've seen and
believe is that it's more the latter and having some "after market"
customization is normal. The community / ASF spinning off projects
like Pig, Hive, ZK, Chukwa, and others reinforce this in my mind;
these are not bits of Hadoop proper, but natural extensions with their
own development path and release schedules.
- No one benefits from Hadoop being difficult for people to use
including those of us at Cloudera[1]. I don't want anyone to see us as
wanting to create complexity. We all benefit from a healthy Hadoop
community.

Technical:

- Any modification to SequenceFile (and friends) worry me as so much
is tightly bound to them. This is something I think is an artifact of
people coding to the implementation rather than the interface, so to
speak.
- Generally, I agree with a lot of Owen's motivation (e.g. codifying
the serialization system, using multiple libs to prove it's proper)
but some of the implementation can be more forgiving of some usage
patterns in the wild (e.g. the conflicting dep version issues, whether
future dev on some these file formats should be extracted from the
Hadoop proper).

Proposal:

- Codify (by vote) whether design plans are required or if an informal
email indicating intent is sufficient, and under what circumstances.
Provide examples to clarify circumstances. Solves the long term but
not HADOOP-6685.
- Focus the discussion on evaluation of proposals for remedying the
process for conflict resolution. I know some exist, but they're
drastic (removal of PMC members, for instance).
- After consensus on above, focus the conversation (in another thread
or on JIRA, whatever is most appropriate) on HADOOP-6685 so no one is
blocked.
- Put the community of users first in all areas of development and interaction.

[1] I am officially speaking out of school. I am not an official
spokesperson for Cloudera. This is my opinion and I happen to work at
Cloudera.

On Tue, Dec 7, 2010 at 3:13 AM, Owen O'Malley <omalley@apache.org> wrote:
>
> On Dec 6, 2010, at 7:36 PM, Eric Sammer wrote:
>
> Eric,
>   Since this is mostly technical, it probably should be on the h-6685 jira
> instead of general@hadoop.
>
>> I think Hadoop should support interfaces to control
>> serialization plugin lifecycle and element serialization to / from an
>> abstract notion of a datum and bytes only.
>
> The core of my h-6685 patch updates the API to replace the typename with a
> serialization name and serialization specific metadata. That metadata is a
> set of bytes that are defined by the serialization. The typename alone is
> insufficient for Avro and having additional metadata will be useful for the
> other serializations as well.
>
> Doug suggested that I add a user-friendly pair of methods and I did. While
> they are redundant, the set of serializations isn't expected to be large and
> therefore the extra code isn't much.
>
>> I would like to not mention
>> a serialization implementation by name in Hadoop proper, at all.
>
> My patch removes some of the lingering references to Writables in
> SequenceFile, MapFile, etc. and moves them over the generic serialization
> API. The framework will likely continue depend on whichever serialization is
> used for RPC. Currently that is Writables, but will likely transition to
> either Avro or ProtoBuf in the coming year.
>
>> A
>> single implementation to serve as a reference implementation makes
>> sense.
>
> A critical part of Hadoop's usability comes from its framework combined with
> library code that allows users to get the desired functionality without
> writing it themselves. Sure, it is easy to write a hash table yourself, but
> it is far easier to use the one bundled with Java.
>
>> The default
>> classpath should remain as free of mandatory external dependencies as
>> possible; library conflicts are still an extremely sore spot in Java
>> development at many sites I visit and forcing a large commercial
>> entity to use version X o something like Avro, Thrift, PB is almost a
>> non-starter for many.
>
> I discussed this problem in the jira, but either the MapReduce user is using
> the X library or doesn't care the version of X. If they are using it, it is
> far more convenient to have the serialization on the classpath. There is a
> missing feature that we need to address to put the user's files ahead of the
> system ones. I'll file a jira for that.
>
> It might also make sense for us to shade some of our dependencies, but that
> is a much bigger issue and is far from clear cut.
>
>> If a PB / Thrift / Avro serialization implementation is part of
>> contrib or externally managed, it requires the user to understand this
>> dependency exists and manage the classpath.
>
> The goal is to make Hadoop useful out of the box. If we make it so that
> Hadoop is only useful once it is bundled with 15 other projects, that is
> good for people who sell distributions that include Hadoop, but not for the
> project.
>
>> I think we can simplify
>> serialization plugin configuration via a classpath include system by
>> using something like run-parts or similar and the current
>> configuration system, but that's another issue.
>
> The current patch loads the serialization plugins based on the
> configuration. If you don't want to support thrift, don't configure it. The
> same holds true of the other serializations, even writable.
>
>> I'm a bit confused as to how this equates with sequence files being
>> deprecated or arrested.
>
> Doug vetoed my patch partially based on his assertion that SequenceFiles
> should be deprecated and that Hadoop should just be the framework with no
> library code.
>
>>  If we choose to focus development
>> elsewhere ("soft deprecate") or actively encourage users elsewhere
>> ("@Deprecated") is an issue I think we can sever from this discussion.
>
> At this point the PMC has supported continuing to invest in developing
> SequenceFiles.
>
>> - Don't break existing SequenceFiles.
>
> That goes without saying, everyone has petabytes of data in them.
>
>> - Common, HDFS, MR should contain as few mandatory external deps as
>> humanly possible because Java classloader semantics and a lack of
>> internal dep isolation is just kookoo for cocoa puffs. (Simplify it
>> and bring on our OSGI overlords.)
>
> That is a much bigger discussion that we should probably have. There are
> costs on both sides in terms of debugging and understandability. In
> particular, in most cases we are much better off using a library that has
> the right functionality that re-implementing it ourselves.
>
>> - We (non-committers / users / casual contributors) want only for
>> Hadoop to mature in features and stability, be an inviting community
>> to new potential contributors and users, and to be around for a long
>> time.
>
> I want that too.
>
> -- Owen
>
>
>



-- 
Eric Sammer
twitter: esammer
data: www.cloudera.com

Mime
View raw message