hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@apache.org>
Subject Re: LimitedPrivate and HBase (thoughts from the build and test world)
Date Thu, 09 Jun 2011 11:42:21 GMT
On 06/08/2011 06:41 PM, Suresh Srinivas wrote:
> I do not see any issue with the change that Todd has made. We have done
> similar changes in HDFS-1586 in the past.
>
> Making APIs public comes with a cost. That is what we are avoiding with
> LimitedPrivate. The intention was to include the following projects that are
> closely tied to Hadoop as projects eligible for LimitedPrivate.
> {"HBase", "HDFS", "Hive", "MapReduce", "Pig"}. This list could grow in the
> future.

I'm going to talk about my experience on the Ant team.

One of the lessons of that project is that in the open source world, you 
can't predict how your code gets used, or control it. If someone wants 
to take your app and use it as a library -they can. If someone wants to 
do something completely unexpected with that library -they can. And this 
is a good thing, because your code gets used. Yes, you get new bugreps, 
but every person using your code is someone not using somebody elses 
code. You win.

The other lesson from that is the following: in open source, there is no 
such thing as private code.

* If you mark something as package scoped, they just inject their 
classes into your package (and who hasn't done that with their Hadoop 
extensions?).
* If you mark something as protected, they subclass and open up its 
privacy.
* If you mark something as private, they edit your source and create a 
new JAR with the relaxed permission

for any of these actions, you end up fielding the bugreps, as the stack 
trace points to you. And it increases maintenance costs for everyone.


Alternatively they cut and paste your code into their codebase, possibly 
-but not always- retaining the apache credits.

That
  * complicates copyright and lawsuits:
  http://www.theserverside.com/news/thread.tss?thread_id=29958

  * increases maintenance costs for everyone, especially if there are 
security issues with the original code.

> When such projects break because of API change, we can co-ordinate as
> community and fix the issues. This is not true for some application that we
> do not know of breaks!

The way Ant handled this with Gump, the nightly clean build of all the 
OSS Java projects built with Ant
http://vmgump.apache.org/gump/public/

For all the projects, they thought they were getting a free CI build 
run, but what it really was was a regression test of Ant and every 
single OSS project. If a change in Ant broke anyone's build: we noticed. 
If a change in Log4J broke a build, someone noticed. It became a 
rapid-response regression test for the entire OSS suite.

Sadly, it doesn't work so well. I'd blame Maven, but the move to ivy 
dependencies doesn't help either, it complicates classpaths no end.

Even so, the idea is great: build and test your downstream applications, 
and the things you depend on, so you find problems within 24 hours of 
the change being committed -regardless of which project committed the 
change.

The way to do it now would be with Jenkins, not just building and 
testing Hadooop-{core, hdfs, mapreduce}, but
  -building and publishing every upstream dependency.
  -test against the trunk versions build locally.
  -build and test against the ivy-versioned artifacts that are 
controlled by the version.properties

Together this flags up when something works against the old artifacts, 
but doesn't work against the trunk versions: that's their regressions, 
caught early.

Downstream
  -build and test the OSS projects that work with Hadoop.
  That's the apache ones: HBase, Mahout, Pig, Hive, Hama etc, and the 
other ones, such as Cascading.

That can be offered as a service to these projects "we will build and 
test your code against our trunk", a service designed to benefit 
everyone. They find their bugs, we find regressions.

This is a pretty complex project, especially when you think about the 
challenge of testing your RPM generation code will install the RPMs (I 
bring up clean CentOS VMs for such a purpose), but without it you don't 
get everything working together, which is the state things appear to be 
in today.

Ignoring the RPM install & test problems, if people are interested in 
working on this, we should be able to do a lot of it on Jenkins. Who is 
willing to get involved?

-Steve

Mime
View raw message