hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Boudnik <...@yahoo-inc.com>
Subject Re: Subprojects and TLP status
Date Wed, 14 Apr 2010 16:19:51 GMT
Great summing up of the Hadoop business as it is today.


On Tue, Apr 13, 2010 at 11:46PM, Chris Douglas wrote:
> Most of Hadoop's subprojects have discussed becoming top-level Apache
> projects (TLPs) in the last few weeks. Most have expressed a desire to
> remain in Hadoop. The salient parts of the discussions I've read tend
> to focus on three aspects: a technical dependence on Hadoop,
> additional overhead as a TLP, and visibility both within the Hadoop
> ecosystem and in the open source community generally.
> Life as a TLP: this is not much harder than being a Hadoop subproject,
> and the Apache preferences being tossed around- particularly
> "insufficiently diverse"- are not blockers. Every subproject needs to
> write a section of the report Hadoop sends to the board; almost the
> same report, sent to a new address. The initial cost is similarly
> light: copy bylaws, send a few notes to INFRA, and follow some
> directions. I think the estimated costs are far higher than they will
> be in practice. Inertia is a powerful force, but it should be
> overcome. The directions are here, and should not intimidating:
> http://apache.org/dev/project-creation.html
> Visibility: the Hadoop site does not need to change. For each
> subproject, we can literally change the hyperlinks to point to the new
> page and be done. Long-term, linking to all ASF projects that run on
> Hadoop from a prominent page is something we all want. So particularly
> in the medium-term that most are considering: visibility through the
> website will not change. Each subproject will still be linked from the
> front page.
> Hadoop would not be nearly as popular as it is without Zookeeper,
> HBase, Hive, and Pig. All statistics on work in shared MapReduce
> clusters show that users vastly prefer running Pig and Hive queries to
> writing MapReduce jobs. HBase continues to push features in HDFS that
> increase its adoption and relevance outside MapReduce, while sharing
> some of its NoSQL limelight. Zookeeper is not only a linchpin in real
> workloads, but many proposals for future features require it. The
> bottom line is that MapReduce and HDFS need these projects for
> visibility and adoption in precisely the same way. I don't think
> separate TLPs will uncouple the broader community from one another.
> Technical dependence: this has two dimensions. First, influencing
> MapReduce and HDFS. This is nonsense. Earning influence by
> contributing to a subproject is the only way to push code changes;
> nobody from any of these projects has violated that by unilaterally
> committing to HDFS or MapReduce, anyway. And anyone cynical enough to
> believe that MapReduce and HDFS would deliberately screw over or
> ignore dependent projects because they don't have PMC members is
> plainly unsuited to community-driven development. I understand that
> these projects need to protect their users, but lobbying rights are
> not an actual benefit.
> Second, being a coherent part of the Hadoop ecosystem. It is (mostly)
> true that Hadoop current offers a set of mutually compatible
> frameworks. It is not true that moving them to separate Apache
> projects would make solutions less coherent or affect existing or
> future users at all. The cohesion between projects' governance is
> sufficiently weak to justify independent units, but the real
> dependencies between the projects are strong enough to keep us engaged
> with one another. And it's not as if other projects- Cascading, for
> example- aren't also organisms adapted and specialized for life in
> Hadoop.
> Arguments on technical dependence are ignoring the nature of the
> existing interactions. Besides, weak technical dependencies are not a
> necessary prerequisite for a subproject's independence.
> As for what was *not* said in these discussions, there is no argument
> that every one of these subprojects has a distinct, autonomous
> community. There was also no argument that the Hadoop PMC offers any
> valuable oversight, given that the representatives of its fiefdoms are
> too consumed by provincial matters to participate in neighboring
> governance. Most releases I've voted on: I run the unit tests, check
> the signature, verify the checksum, and know literally nothing else
> about its content. I have often never heard the names of many proposed
> committers and even some proposed PMC members. Right now, subprojects
> with enough PMC members essentially vote out their own releases and
> vote in their own committers: TLPs in all but name.
> The Hadoop club- in conferences, meetups, technical debates, etc.- is
> broad, diverse, and intertwined, but communities of developers have
> already clustered around subprojects. Allowing that each cluster
> should govern itself is a dry, practical matter, not an existential
> crisis. -C

View raw message