hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eli Collins <...@cloudera.com>
Subject Re: bringing the codebases back in line
Date Fri, 22 Oct 2010 00:37:40 GMT
On Thu, Oct 21, 2010 at 4:50 PM, Ian Holsman <hadoop@holsman.net> wrote:
> right.. Cloudera is bundling it's add-ons into a single tarball to make it
> easier to install.

CDH contains a number of different projects, however each project has
a distinct tarball (and packages). The tarball is essentially an
Apache release (tarball) plus a directory that has a set of patches
that we've applied to the Apache release (our build process downloads
the Apache release and applies our set of patches).  For each version
of CDH we rebase our patch set on the latest Apache dot release
available at the time to minimize our delta with upstream.  Here's an
example tarball:

> In my ideal world, I'd like to be able to just download/buy any of those
> tools and have them run on a released apache hadoop tarball. and then if
> someone else comes along with a competing tool I would be free to choose it
> and have it also run on my apache hadoop tarball, not have to go through the
> pain of saying XXX tool needs their customized version of hadoop so I can't
> use it. (ie remove the lock-in that comes from a forked base).

All of our Apache projects are an Apache release plus a set of
patches, these are typically backports of bug fixes in trunk but not a
dot release. Except for Hadoop, the set of additional patches is very
small. Here's an example, the 16 changes not in Pig 0.7 that we've
included: http://archive.cloudera.com/cdh/3/pig-0.7.0+16.CHANGES.txt

> so what I'd like to see is both cloudera and yahoo running a minimal set of
> patches as a 'superset' of the apache hadoop stuff, with the apache hadoop
> very close to both of these. the only patches being in either being to fix
> bugs or performance issues that would be available in the next release of
> a-hadoop.

That's our goal as well. For all the Apache projects in CDH, except
for Hadoop, that is the case today.  For CDH3 we ended up adding large
additional patch sets (the  security patch set the append patch set to
support HBase), but for Apache 22 the majority of the delta that CDH
and YDH have against Apache 20 will go away (thanks to Y! contributing
security and append to trunk).

> And when a new release of a-hadoop comes, it the vendors would switch to
> using that a-hadoop version as their baseline.
> I don't want to get into the situation that linux is in with redhat in that
> their kernel is dramatically different to the one on kernel.org.
> does that make sense?


> On Thu, Oct 21, 2010 at 6:42 PM, Owen O'Malley <omalley@apache.org> wrote:
>> On Oct 21, 2010, at 3:19 PM, Doug Cutting wrote:
>>  Cloudera's distribution is based on Y!'s 0.20 distribution, together with
>>> patches from the Apache 0.20-append branch,
>> Cloudera's Distribution of Hadoop includes many tools from outside of
>> Hadoop and even outside of Apache.
>> -- Owen

View raw message