hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Yang <ey...@yahoo-inc.com>
Subject Re: [DISCUSSION] development process of Hadoop
Date Thu, 05 May 2011 17:32:10 GMT
Git is powerful in maintaining different branch of the source code.  However, it will only
work if the entire community is willing to move to git.  Maintaining svn and git hybrid, is
a time consuming task that we are paying in full price.  Hadoop community should work smarter
for the source control.  What do people think about fully adopting git instead of svn?

Regards,
Eric

On 5/5/11 4:35 AM, "Steve Loughran" <stevel@apache.org> wrote:

On 05/05/11 10:51, Tony Valderrama wrote:
> Hi, I just wanted to drop in a few thoughts from a new developer
> working outside of the Hadoop developer community.
>
> On Wed, May 4, 2011 at 7:39 PM, Eric Yang<eyang@yahoo-inc.com>  wrote:
>> While the world demand agility, the "review then commit" process is preventing progress
>> from happening.  People end up having to generate multiple version of patches to
ensure
>> the code can be applied.  The large lag time between patch generation and reviewed
>> is taking significant toll on the community and progress.
>
>> Yahoo have a great team of developers who improves Hadoop at faster pace with its
own
>> fork of the source code.  The reason that Yahoo was able to achieve faster improvement
with
>> features was due to the ability to use source code repository tools properly.  Unfortunate
>> for Yahoo, their source code repository was not Apache svn trunk.
>
> I agree that the review process is broken.  However, the current
> situation is exactly the result of a lack of adherence to this and
> other processes.  Various subgroups within the community have
> (intentionally or unintentionally) hijacked the project at different
> times by avoiding community processes in the interest of agility or
> commercial benefit, and the result is a highly fragmented project with
> no clear direction.
>
>  From the outside, Hadoop looks like a Yahoo/Cloudera project which
> occasionally gets an Apache stamp.  Given the lack of adherence to
> processes, as a non-Yahoo/Cloudera developer I have no way of breaking
> into the development community.  Who's going to review or commit
> patches I submit?  And which of the myriad versions should I even be
> trying to patch against?  And given the speed with which undocumented
> changes are being made, how am I supposed to figure out if my changes
> are going to be relevant or viable next week?  We'd love to contribute
> back, but it's just not clear that we or other small players have any
> place within the Hadoop developer community.

As someone who has commit rights but undercommits, here are my issues
  -I am not full time on hadoop, I have little time to keep my own code
up to date, let alone review patches
  -I am not fully up to date with all the changes or subtleties in what
is a big, complicated system
  -I don't want to break the big systems (Y!, Facebook) by introducing
changes that work on my network and my (small, dynamic) clusters but
which place limitations on scale. It's why I prefer review by those
people who do work on large scale projects.

>
>> Use JIRA, if there is large feature set that requires brain storming, and developers
>> should have the ability to make small incremental changes without RTC.  This will
ensure developers
>> help each other rather than policing each other.
>
> As an outsider, JIRA is the only way I've been able to follow the
> changes to Hadoop's code and guess where the project is heading.
> Permitting developers to commit without review or documentation will
> just further exclude anyone who can't walk down the hall and knock on
> an office door to ask about a commit.

I've worked in other ASF projects (Axis) where some large dev teams
(IBM) used to make decisions in team meetings and propagate them. It's
faster, but less community centric, and when a large dev team (IBM) get
re-assigned internally everyone is left not just scrambling to catch up
engineering-wise, but also to make sense of big chunks of
under-documented code. At least the JIRA-based review process not only
provides a discussion log, Hudson/Jenkins checks that there are tests,
no extra warnings, etc.

What could be interesting would be
  -a move to Git to make it easier to pull in patches from other
branches, and for people like Tony to have their own fork under SCM.
  -adoption of Gerrit for having each JIRA issue move from being a patch
to a branch (local or remote), so that people can develop the code for
an issue, others can pull it in and merge it, and so that the issue
tracks live code, not dead patches
  -more testing of trunk in bigger real/virtual clusters

I don't know how we can do this, I'd love to hear about experiences
others have with such a process.



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message