hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bernd Fondermann <bernd.fonderm...@googlemail.com>
Subject Re: [VOTE] Abandon hdfsproxy HDFS contrib
Date Fri, 18 Feb 2011 13:20:46 GMT
Hi Eric,

On Fri, Feb 18, 2011 at 13:46, Eric Baldeschwieler <eric14@yahoo-inc.com> wrote:
> Hi Bernd,
>
> Apache Hadoop is about scale. Most clusters will always be small, but Hadoop is going
mainstream precisely because it scales to huge data and cluster sizes.
>
> There are lots of systems that work well on 10 node clusters. People select   Hadoop
because they are confident that as their business / problem grows, Hadoop can grow with it.

Please note that I did not say that Hadoop should not scale.
I know that winning Sorting contests is a great achievement and a huge
selling point.

I'm thinking along the lines of: How much scalability would the
majority of users be willing to trade for
a. more active committers (guess: 0%)
b. more regular releases
c. more non-scalability features (hot standby NN, security, younameit)

I for myself as a low-scale user *would* trade a few percent for b. and c.

Thanks,

  Bernd

> ---
> E14 - via iPhone
>
> On Feb 17, 2011, at 7:25 AM, "Bernd Fondermann" <bernd.fondermann@googlemail.com>
wrote:
>
>> On Thu, Feb 17, 2011 at 14:58, Ian Holsman <hadoop@holsman.net> wrote:
>>> Hi Bernd.
>>>
>>> On Feb 17, 2011, at 7:43 AM, Bernd Fondermann wrote:
>>>>
>>>> We have the very unfortunate situation here at Hadoop where Apache
>>>> Hadoop is not the primary and foremost place of Hadoop development.
>>>> Instead, code is developed internally at Yahoo and then contributed in
>>>> (smaller or larger) chunks to Hadoop.
>>>
>>> This has been the situation in the past,
>>> but as you can see in the last month, this has changed.
>>>
>>> Yahoo! has publicly committed to move their development into the main code base,
and you can see they have started doing this with the 20.100 branch,
>>> and their recent commits to trunk.
>>> Combine this with Nige taking on the 0.22 release branch, (and sheperding it
into a stable release) and I think we have are addressing your concerns.
>>>
>>> They have also started bringing the discussions back on the list, see the recent
discussion about Jobtracker-nextgen Arun has re-started in MAPREDUCE-279.
>>>
>>> I'm not saying it's perfect, but I think the major players understand there is
an issue, and they are *ALL* moving in the right direction.
>>
>> I enthusiastically would like to see your optimism be verified.
>> Maybe I'm misreading the statements issued publicly, but I don't think
>> that this is fully understood. I agree though that it's a move into
>> the right direction.
>>
>>>> This is open source development upside down.
>>>> It is not ok for people to diff ASF svn against their internal code
>>>> and provide the diff as a patch without reviewing IP first for every
>>>> line of code changed.
>>>> For larger chunks I'd suggest to even go via the Incubator IP clearance process.
>>>> Only then will we force committers to primarily work here in the open
>>>> and return to what I'd consider a healthy project.
>>>>
>>>> To be honest: Hadoop is in the process of falling apart.
>>>> Contrib Code gets moved out of Apache instead of being maintained here.
>>>> Discussions are seldom consense-driven.
>>>> Release branches stagnate.
>>>
>>> True. releases do take a long time. This is mainly due to it being extremely
hard to test and verify that a release is stable.
>>> It's not enough to just run the thing on 4 machines, you need at least 50 to
test some of the major problems. This requires some serious $ for someone to verify.
>>
>> It has been proposed on the list before, IIRC. Don't know how to get
>> there, but the project seriously needs access to a cluster of this
>> size.
>>
>>>> Downstream projects like HBase don't get proper support.
>>>> Production setups are made from 3rd party distributions.
>>>> Development is not happening here, but elsewhere behind corporate doors.
>>>> Discussion about future developments are started on corporate blogs (
>>>> http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/
>>>> ) instead of on the proper mailing list.
>>>> Hurdles for committing are way too high.
>>>> On the bright side, new committers and PMC members are added, this is
>>>> an improvement.
>>>>
>>>> I'd suggest to move away from relying on large code dumps from
>>>> corporations, and move back to the ASF-proven "individual committer
>>>> commits on trunk"-model where more committers can get involved.
>>>> If that means not to support high end cluster sizes for some months,
>>>> well, so be it.
>>>
>>>> Average committers cannot run - e.g. test - on high
>>>> end cluster sizes. If that would mean they cannot participate, then
>>>> the open source project better concentrate on small and medium sized
>>>> cluster instead.
>>>
>>>
>>> Well.. that's one approach.. but there are several companies out there who rely
on apache's hadoop to power their large clusters, so I'd hate to see hadoop become something
that only runs well on
>>> 10-nodes.. as I don't think that will help anyone either.
>>
>> But only looking at high-end scale doesn't help either.
>>
>> Lets face the fact that Hadoop is now moving from early adaptors phase
>> into a much broader market. I predict that small to medium sized
>> clusters will be the majority of Hadoop deployments in a few month
>> time. 4000, or even 500 machines is the high-end range. If the open
>> source project Hadoop cannot support those users adequately (without
>> becoming defunct), the committership might be better off to focus on
>> the low-end and medium sized users.
>>
>> I'm not suggesting to turn away from the handfull (?) of high-end
>> users. They certainly have most valuable input. But also, *they*
>> obviously have the resources in terms of larger clusters and
>> developers to deal with their specific setups. Obviously, they don't
>> need to rely on the open source project to make releases. In fact,
>> they *do* work on their own Hadoop derivatives.
>> All the other users, the hundreds of boring small cluster users, don't
>> have that choice. They *depend* on the open source releases.
>>
>> Hadoop is an Apache project, to provide HDFS and MR free of charge to
>> the general public. Not only to me - nor to only one or two big
>> companies either.
>> Focus on all the users.
>>
>>  Bernd
>

Mime
View raw message