hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Re: [DISCUSSION] development process of Hadoop
Date Thu, 05 May 2011 11:35:21 GMT
On 05/05/11 10:51, Tony Valderrama wrote:
> Hi, I just wanted to drop in a few thoughts from a new developer
> working outside of the Hadoop developer community.
> On Wed, May 4, 2011 at 7:39 PM, Eric Yang<eyang@yahoo-inc.com>  wrote:
>> While the world demand agility, the "review then commit" process is preventing progress
>> from happening.  People end up having to generate multiple version of patches to
>> the code can be applied.  The large lag time between patch generation and reviewed
>> is taking significant toll on the community and progress.
>> Yahoo have a great team of developers who improves Hadoop at faster pace with its
>> fork of the source code.  The reason that Yahoo was able to achieve faster improvement
>> features was due to the ability to use source code repository tools properly.  Unfortunate
>> for Yahoo, their source code repository was not Apache svn trunk.
> I agree that the review process is broken.  However, the current
> situation is exactly the result of a lack of adherence to this and
> other processes.  Various subgroups within the community have
> (intentionally or unintentionally) hijacked the project at different
> times by avoiding community processes in the interest of agility or
> commercial benefit, and the result is a highly fragmented project with
> no clear direction.
>  From the outside, Hadoop looks like a Yahoo/Cloudera project which
> occasionally gets an Apache stamp.  Given the lack of adherence to
> processes, as a non-Yahoo/Cloudera developer I have no way of breaking
> into the development community.  Who's going to review or commit
> patches I submit?  And which of the myriad versions should I even be
> trying to patch against?  And given the speed with which undocumented
> changes are being made, how am I supposed to figure out if my changes
> are going to be relevant or viable next week?  We'd love to contribute
> back, but it's just not clear that we or other small players have any
> place within the Hadoop developer community.

As someone who has commit rights but undercommits, here are my issues
  -I am not full time on hadoop, I have little time to keep my own code 
up to date, let alone review patches
  -I am not fully up to date with all the changes or subtleties in what 
is a big, complicated system
  -I don't want to break the big systems (Y!, Facebook) by introducing 
changes that work on my network and my (small, dynamic) clusters but 
which place limitations on scale. It's why I prefer review by those 
people who do work on large scale projects.

>> Use JIRA, if there is large feature set that requires brain storming, and developers
>> should have the ability to make small incremental changes without RTC.  This will
ensure developers
>> help each other rather than policing each other.
> As an outsider, JIRA is the only way I've been able to follow the
> changes to Hadoop's code and guess where the project is heading.
> Permitting developers to commit without review or documentation will
> just further exclude anyone who can't walk down the hall and knock on
> an office door to ask about a commit.

I've worked in other ASF projects (Axis) where some large dev teams 
(IBM) used to make decisions in team meetings and propagate them. It's 
faster, but less community centric, and when a large dev team (IBM) get 
re-assigned internally everyone is left not just scrambling to catch up 
engineering-wise, but also to make sense of big chunks of 
under-documented code. At least the JIRA-based review process not only 
provides a discussion log, Hudson/Jenkins checks that there are tests, 
no extra warnings, etc.

What could be interesting would be
  -a move to Git to make it easier to pull in patches from other 
branches, and for people like Tony to have their own fork under SCM.
  -adoption of Gerrit for having each JIRA issue move from being a patch 
to a branch (local or remote), so that people can develop the code for 
an issue, others can pull it in and merge it, and so that the issue 
tracks live code, not dead patches
  -more testing of trunk in bigger real/virtual clusters

I don't know how we can do this, I'd love to hear about experiences 
others have with such a process.

View raw message