Mailing-List: contact dev-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@accumulo.apache.org
Message-ID: <55E21C74.2000900@apache.org>
Date: Sat, 29 Aug 2015 16:56:20 -0400
From: Josh Elser <elserj@apache.org>
User-Agent: Postbox 3.0.11 (Macintosh/20140602)
MIME-Version: 1.0
To: user@accumulo.apache.org
CC: Accumulo Dev List <dev@accumulo.apache.org>
Subject: Re: [Accumulo Contrib Proposal] Graphulo: Server-side Matrix Math
 library
References: 
 <CAPx=JkYGk24+sabU01w8pKkChSW-shDHx=tQ6QkC9uBym0-MwA@mail.gmail.com>
In-Reply-To: 
 <CAPx=JkYGk24+sabU01w8pKkChSW-shDHx=tQ6QkC9uBym0-MwA@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit

I've been thinking about this from a mindset of "what would the 
Incubator think" with this project. Notably, focusing on the community 
over code precedent that the Incubator starts to instill.

Given what I see, it seems like Dylan is the sole contributor to 
Graphulo at this point (given his remark about owning 99% of the IP). As 
such, there is no other community around this code at this point. While 
it seems like the desire to make it more strongly related to Accumulo 
(via contrib/sub-project/whatever), this alone isn't going to make it a 
community. That's something that happens through blood, sweat and tears. 
Having it in Apache doesn't cause any magic to happen; a community can 
form just the same on Github or any other software hosting forum;

<pmc hat>
 From the Accumulo guidance perspective, I don't think we have the 
processes and guidelines in place to even consider fostering a new 
project in addition to Accumulo itself. We have a hard enough time 
keeping Accumulo active and responsive (see how many people are 
regularly active given the number of committers we have). Before we'd 
accept any new community, I think we actually need to define processes 
to manage and grow such processes so we avoid another code dump and run 
into contrib/.
</pmc hat>

I _do_ think we can engage with Dylan and try to give more visibility 
into Graphulo from the Accumulo project level: include content on 
accumulo.a.o about Graphulo, work on sharing more code between the 
projects (aforementioned iterators), posts on blogs.a.o, etc. I'd like 
to think that we would extend the same effort to projects like GeoMesa 
or GeoWave should they ask.

In short, I don't feel like Accumulo is in a place to accept and guide 
new projects, nor do I think Graphulo is in a place to benefit from 
moving to a subproject/contrib (or, at a minimum, moving it into Apache 
provides nothing that it can't already do).

Dylan Hutchison wrote:
> Dear Accumulo community,
>
> I humbly ask your consideration of Graphulo
> <https://github.com/Accla/graphulo> as a new contrib project to
> Accumulo.  Let's use this thread to discuss what Graphulo is, how it
> fits into the Accumulo community, where we can take it together as a new
> community, and how you can use it right now.  Please see the README at
> Graphulo's Github, and for a more in-depth look see the docs/ folder or
> the examples.
>
>     https://github.com/Accla/graphulo
>
>     Graphulo is a Java library for the Apache Accumulo database
>     delivering server-side sparse matrix math primitives that enable
>     higher-level graph algorithms and analytics.
>
> Pitch: Organizations use Accumulo for high performance indexed and
> distributed data storage.  What do they do after their data is stored?
> Many use cases perform analytics and algorithms on data in Accumulo,
> which aside from simple iterators uses, require scanning data out from
> Accumulo to a computation engine, only to write computation results back
> to Accumulo.  Graphulo enables a class of algorithms to run inside the
> Accumulo server like a stored procedure, especially (but not restricted
> to) those written in the language of graphs and linear algebra.  Take
> breadth first search as a simple use case and PageRank as one more
> complex.  As a stretch goal, imagine analysts and mathematicians
> executing PageRank and other high level algorithms on top of the
> Graphulo library on top of Accumulo at high performance.
>
> I have developed Graphulo at the MIT Lincoln Laboratory with support
> from the NSF since last March.  I owe thanks to Jeremy Kepner, Vijay
> Gadepally, and Adam Fuchs for high level comments during design and
> performance testing phases.  I proposed a now-obsolete design document
> last Spring to the Accumulo community too which received good feedback.
>
> The time is ripe for Graphulo to graduate my personal development into
> larger participation.  Beyond myself and beyond the Lincoln Laboratory,
> Graphulo is for the Accumulo community.  Users need a place where they
> can interact, developers need a place where they can look, comment, and
> debate designs and diffs, and both users and developers need a place
> where they can interact and see Graphulo alongside its Accumulo base.
>
> The following outlines a few reasons why I see contrib to Accumulo as
> Graphulo's proper place:
>
>  1. Establishing Graphulo as an Apache (sub-)project is a first step
>     toward building a community.  The spirit of Apache--its mailing list
>     discussions, low barrier to interactions between users and
>     developers new and old, open meritocracy and more--is a surefire way
>     to bring Graphulo to the people it will help and the people who want
>     to help it in turn.
>
>  2. Committing to core Accumulo doesn't seem appropriate for all of
>     Graphulo, because Graphulo uses Accumulo in a specific way
>     (server-side computation) in support of algorithms and
>     applications.  Parts of Graphulo that are useful for all Accumulo
>     users (not just matrix math for algorithms) could be transferred
>     from Graphulo to Accumulo, such as ApplyIterator or
>     SmallLargeRowFilter or DynamicIterator.
>
>  3. Leaving Graphulo as an external project leaves Graphulo too
>     decoupled from Accumulo.  Graphulo has potential to drive features
>     in core Accumulo such as ACCUMULO-3978 <http://ACCUMULO-3978>,
>     ACCUMULO-3710 <https://issues.apache.org/jira/browse/ACCUMULO-3710>,
>     and ACCUMULO-3751
>     <https://issues.apache.org/jira/browse/ACCUMULO-3751>.  By making
>     Graphulo a contrib sub-project, the two can maintain a tight
>     relationship while still maintaining independent versions.
>
> Historically, contrib projects have gone into Accumulo contrib and
> become stale.  I assure you I do not intend Graphulo this fate.  The
> Lincoln Laboratory has interests in Graphulo, and I will continue
> developing Graphulo at the very least to help transition Graphulo to
> greater community involvement.  However, since I will start a PhD
> program at UW next month, I cannot make Graphulo a full time job as I
> have in recent history.  I do have ideas for using Graphulo as part of
> my PhD database research.
>
> Thus, in the case of large community support, I can transition to a
> support role while others in the community step up.  If smaller
> community support, I can continue working on Graphulo as before at my
> own pace and perhaps more publicly.  There are only a few steps left
> before Graphulo could be considered "finished software":
>
>   * Developing a new interface to Graphulo's core functions using
>     immutable argument objects, which simplifies developer APIs,
>     increases generalizability, and facilitates features like
>     asynchronous and parallel operations.  It would be good if other
>     developers weigh their opinions on designs as we propose them, since
>     this decides how Graphulo users interact with Graphulo.
>
>   * Instrumenting Graphulo for monitoring, profiling and benchmarking.
>     I have a blueprint on how to use HTrace to make these tasks as easy
>     as browsing a web page.  Needs careful thought and discussion before
>     implementing, since this instrumentation will go everywhere.  It
>     would be nice if Graphulo and Accumulo mirror instrumentation
>     strategies, so it would be good to have that discussion in the same
>     venue.
>
>   * Rigorous *scale testing*.  Good instrumentation is key.  With
>     successful scale testing, we paint a clear picture for which
>     operations Graphulo excels to potential adopters, ultimately
>     plotting where Graphulo stands in the world of big data software.
>
>   * Explicitly supporting the GraphBLAS <http://graphblas.org/> spec,
>     once it is agreed upon.  Graphulo was designed from the ground up
>     with the GraphBLAS in mind, so this should be an easy task.
>     Aligning with this upcoming industry standard bodes well for ease of
>     developing Graphulo algorithms.
>
> Developing more algorithms and applications will follow too, and I
> imagine this as an excellent place where newcomer users can get involved.
>
> Some other places Graphulo needs work worth mentioning are creating a
> proper release framework (the release here
> <https://github.com/Accla/graphulo/releases> could use improvement,
> starting with signed artifacts) and reviewing the way Graphulo runs
> tests (currently centered around a critical file called TEST_CONFIG.java
> which is great for one developer, whereas a config file probably works
> better).  Both of these are places more experienced developers could
> help.  I should also mention that Graphulo has groundwork in place for
> operations between Accumulo instances, but I doubt many users would need
> that level of control.
>
> Regarding IP, I'm happy to donate my commits to the ASF, which covers
> 99% of the Graphulo code base.  I'm sure other issues will arise and we
> can sort them out.  Sean Busbey, perhaps I could ask your assistance as
> someone more knowledgeable in this area.  Regarding
> dependencies, effectively every direct dependency is org.apache, so
> nothing to worry about here.
>
> I acknowledge that I will lose dictatorial power and gain some
> bureaucratic / discussion overhead by moving from sole developer to an
> Apache model.  The benefits of a community are well worth it.
>
> If we as a community decide that contrib is the right place for
> Graphulo, then there are lots of logistical questions to decide like
> where the code will live, where JIRA will live, what mailing lists to
> use, what URL to give Graphulo within apache.org <http://apache.org>,
> etc.  We can tackle these at our leisure.  Let's discuss Graphulo and
> Accumulo here first.
>
> Warmly,
> Dylan Hutchison