accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Whyne <ericwh...@gmail.com>
Subject Re: Code import for Apache Accumulo
Date Tue, 29 Oct 2013 20:20:12 GMT
Some thoughts to re-ignite this thread:

The raccumulo project has some of it's code written in the language R, but
does not borrow any code from the R codebase and as such is not a
derivative work.

Unless anybody can think of a way in which R's own licensing could become a
concern, potential license conflicts might be a dead issue?

Some background:
R is a statistics domain specific language used mostly for statistics
research.
http://www.r-bloggers.com/r-usage-skyrocketing-rexer-poll/
"*R <http://www.revolutionanalytics.com/what-is-open-source-r/> is the most
popular data mining tool*, used at least occasionally by 70% of those
polled. This popularity holds amongst all of the subgroups in the survey as
well: R remains the most-used tool amongst corporate data miners (70%),
consulting data miners (73%), academic data miners (75%) and
nonprofit/NGO/government data miners (67%). And while the average data
miner reports using five software tools, R is also the most popular primary
tool in the survey, at 24% overall. "

The raccumulo code base was written for a defense customer, but has since
had investment from several DARPA programs and DHS because of the
importance of both accumulo and r. They practically go together like peanut
butter and jelly (I just made that up). Projects analagous to raccumulo
exist for HBase (rhbase).

The primary developer Phil Grim has signed an ICLA that I'm going to send
off tomorrow pending our company's contracts department's approval. Same
with company level CCLA, complete and pending final review. Phil, Aaron,
and Myself as listed as representatives on it.

Insofar as observations about lack of committership:
Phil has been willing to share his code for a while and wants to keep
contributing.
https://issues.apache.org/jira/browse/SQOOP-767
https://issues.apache.org/jira/browse/ACCUMULO-141
discussion about this topic here:
http://www.mail-archive.com/notifications@accumulo.apache.org/msg10665.html

The other developer, Aaron is listed as a previous contributor to accumulo:
http://accumulo.apache.org/people.html

More about what's going on at the company:
https://twitter.com/DataTactics

More about DARPA XData (one of the programs of interest):
http://www.darpa.mil/Our_Work/I2O/Programs/XDATA.aspx
The customer project includes a charter to contribute to open source:
"XDATA plans to release open-source software toolkits to enable
collaboration among the applied mathematics, computer science and data
visualization communities."

As a company we'd be happy to just keep hosting the code on our Github
page, but I think we'd rather see it be included closer to the accumulo
project as mentioned previously. Given the momentum of R, the interest of
DARPA and others, I think the benefits outweigh he risks. There's an
extremely small chance of an orphaned project and even then as a 200+
person company there's somebody you can blame if it does become a problem.
We have a twitter account and github page people can go to with help
requests or fixes.

We are interested in hearing more about how to best continue. I'll send a
note when CCLA and ICLAs are fully executed.



On Tue, Oct 29, 2013 at 5:52 AM, Steve Loughran <stevel@hortonworks.com>wrote:

> On 29 October 2013 00:02, Christopher <ctubbsii@apache.org> wrote:
>
> > +1 for it's own repo... but due to licensing concerns of the R
> > dependency, and lack of committership of the original developers, I'm
> > not sure it makes much sense for Accumulo to adopt it as a sub-project
> > by importing it, which would mean taking on the responsibility of
> > maintaining it.
> >
>
> That's the eternal problem with contributed code. Close to the project: you
> can keep an eye on it, but then people expect it to work and blame you if
> it can't. But at the same time, those contributions build up your project's
> functionality.
>
> One rule that I've found works is: never accept code that you can't test
> yourself.
>
> If it needs some non-standard filesystem, lots of pre-installed binaries or
> human intervention, its not something that you can hook up to a CI build,
> or test in a release process -so it will be broken almost from the outset.
>
> If you can test it yourself, even if if you have to pay a few cents of S3
> or openstack cluster time, then it is something you could consider
> releasing as "tested". Otherwise, it'll just become a maintenance and
> support nightmare in years to come.
>
> In Hadoop core some of the contribs/ -the schedulers - were pulled in, but
> other contrib stuff is now out -the general policy being "no orphaned works
> in the core codebase".
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message