accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dlmar...@comcast.net
Subject Re: [Accumulo Contrib Proposal] Graphulo: Server-side Matrix Math library
Date Fri, 28 Aug 2015 16:48:24 GMT
[adding dev@ back in, I forgot to reply-all last time] 

I think the process and exit criteria would be different for a code contrbution vs a sub-project.
[1] talks about projects and sub-projects, [2] talks about contributions. I don't know if
the exit criteria for a sub-project is the same as a top level project; will you be required
to show community growth, understanding the process, setting up the infrastructure, etc. If
so, who is going to shepherd Graphulo through this process? I'm not an expert in this area.
I just wanted to point out that they are likely hurdles of different height. 

[1] incubator.apache.org/incubation/Incubation_Policy.html 
[2] http://incubator.apache.org/ip-clearance/ 

----- Original Message -----

From: "Dylan Hutchison" <dhutchis@uw.edu> 
To: "Accumulo User List" <user@accumulo.apache.org>, "Accumulo Dev List" <dev@accumulo.apache.org>

Sent: Friday, August 28, 2015 11:25:11 AM 
Subject: Re: [Accumulo Contrib Proposal] Graphulo: Server-side Matrix Math library 



place this in the contrib area or create a sub-project? 



Ah ha, I indeed had the two avenues mistakenly equated in my head since both involve Incubator
approval and the same proposal and IP template. 

I intend Graphulo as a sub-project of Accumulo. There are enough use cases unrelated to Accumulo's
core development (algorithms, Graphulo client, Graphulo-specific iterators) that it makes
sense to form a dedicated project for Graphulo. That said, Graphulo is coupled to Accumulo
by design and purpose, and there is large opportunity for synergy in that Graphulo development
may help Accumulo development and vice versa. We're in that happy middle spot where a sub-project
makes sense. That said, this is a community decision, and so I'm open to other opinions. 

Regards, Dylan 

On Fri, Aug 28, 2015 at 8:08 AM, dlmarion < dlmarion@comcast.net > wrote: 

<blockquote>

Dylan, 

I am a little confused about whether you want to place this in the contrib area or whether
you want to create a sub-project as both are mentioned in your proposal. Also, if you intend
for this to be a sub-project, have you looked at the incubator process? From what I understand
given that this is a code contribution,it will have to go through that process. 




-------- Original message -------- 
From: Dylan Hutchison < dhutchis@uw.edu > 
Date: 08/28/2015 2:43 AM (GMT-05:00) 
To: Accumulo Dev List < dev@accumulo.apache.org > 
Cc: Accumulo User List < user@accumulo.apache.org > 
Subject: [Accumulo Contrib Proposal] Graphulo: Server-side Matrix Math library 

Dear Accumulo community, 

I humbly ask your consideration of Graphulo as a new contrib project to Accumulo. Let's use
this thread to discuss what Graphulo is, how it fits into the Accumulo community, where we
can take it together as a new community, and how you can use it right now. Please see the
README at Graphulo's Github, and for a more in-depth look see the docs/ folder or the examples.



<blockquote>

https://github.com/Accla/graphulo 

</blockquote>


<blockquote>

Graphulo is a Java library for the Apache Accumulo database delivering server-side sparse
matrix math primitives that enable higher-level graph algorithms and analytics. 


</blockquote>

Pitch: Organizations use Accumulo for high performance indexed and distributed data storage.
What do they do after their data is stored? Many use cases perform analytics and algorithms
on data in Accumulo, which aside from simple iterators uses, require scanning data out from
Accumulo to a computation engine, only to write computation results back to Accumulo. Graphulo
enables a class of algorithms to run inside the Accumulo server like a stored procedure, especially
(but not restricted to) those written in the language of graphs and linear algebra. Take breadth
first search as a simple use case and PageRank as one more complex. As a stretch goal, imagine
analysts and mathematicians executing PageRank and other high level algorithms on top of the
Graphulo library on top of Accumulo at high performance. 

I have developed Graphulo at the MIT Lincoln Laboratory with support from the NSF since last
March. I owe thanks to Jeremy Kepner, Vijay Gadepally, and Adam Fuchs for high level comments
during design and performance testing phases. I proposed a now-obsolete design document last
Spring to the Accumulo community too which received good feedback. 

The time is ripe for Graphulo to graduate my personal development into larger participation.
Beyond myself and beyond the Lincoln Laboratory, Graphulo is for the Accumulo community. Users
need a place where they can interact, developers need a place where they can look, comment,
and debate designs and diffs, and both users and developers need a place where they can interact
and see Graphulo alongside its Accumulo base. 

The following outlines a few reasons why I see contrib to Accumulo as Graphulo's proper place:



    1. Establishing Graphulo as an Apache (sub-)project is a first step toward building a
community. The spirit of Apache--its mailing list discussions, low barrier to interactions
between users and developers new and old, open meritocracy and more--is a surefire way to
bring Graphulo to the people it will help and the people who want to help it in turn. 

    2. Committing to core Accumulo doesn't seem appropriate for all of Graphulo, because Graphulo
uses Accumulo in a specific way (server-side computation) in support of algorithms and applications.
Parts of Graphulo that are useful for all Accumulo users (not just matrix math for algorithms)
could be transferred from Graphulo to Accumulo, such as ApplyIterator or SmallLargeRowFilter
or DynamicIterator. 

    3. Leaving Graphulo as an external project leaves Graphulo too decoupled from Accumulo.
Graphulo has potential to drive features in core Accumulo such as ACCUMULO-3978 , ACCUMULO-3710
, and ACCUMULO-3751 . By making Graphulo a contrib sub-project, the two can maintain a tight
relationship while still maintaining independent versions. 

Historically, contrib projects have gone into Accumulo contrib and become stale. I assure
you I do not intend Graphulo this fate. The Lincoln Laboratory has interests in Graphulo,
and I will continue developing Graphulo at the very least to help transition Graphulo to greater
community involvement. However, since I will start a PhD program at UW next month, I cannot
make Graphulo a full time job as I have in recent history. I do have ideas for using Graphulo
as part of my PhD database research. 

Thus, in the case of large community support, I can transition to a support role while others
in the community step up. If smaller community support, I can continue working on Graphulo
as before at my own pace and perhaps more publicly. There are only a few steps left before
Graphulo could be considered "finished software": 


    * Developing a new interface to Graphulo's core functions using immutable argument objects,
which simplifies developer APIs, increases generalizability, and facilitates features like
asynchronous and parallel operations. It would be good if other developers weigh their opinions
on designs as we propose them, since this decides how Graphulo users interact with Graphulo.


    * Instrumenting Graphulo for monitoring, profiling and benchmarking. I have a blueprint
on how to use HTrace to make these tasks as easy as browsing a web page. Needs careful thought
and discussion before implementing, since this instrumentation will go everywhere. It would
be nice if Graphulo and Accumulo mirror instrumentation strategies, so it would be good to
have that discussion in the same venue. 

    * Rigorous scale testing . Good instrumentation is key. With successful scale testing,
we paint a clear picture for which operations Graphulo excels to potential adopters, ultimately
plotting where Graphulo stands in the world of big data software. 

    * Explicitly supporting the GraphBLAS spec, once it is agreed upon. Graphulo was designed
from the ground up with the GraphBLAS in mind, so this should be an easy task. Aligning with
this upcoming industry standard bodes well for ease of developing Graphulo algorithms. 

Developing more algorithms and applications will follow too, and I imagine this as an excellent
place where newcomer users can get involved. 

Some other places Graphulo needs work worth mentioning are creating a proper release framework
(the release here could use improvement, starting with signed artifacts) and reviewing the
way Graphulo runs tests (currently centered around a critical file called TEST_CONFIG.java
which is great for one developer, whereas a config file probably works better). Both of these
are places more experienced developers could help. I should also mention that Graphulo has
groundwork in place for operations between Accumulo instances, but I doubt many users would
need that level of control. 

Regarding IP, I'm happy to donate my commits to the ASF, which covers 99% of the Graphulo
code base. I'm sure other issues will arise and we can sort them out. Sean Busbey, perhaps
I could ask your assistance as someone more knowledgeable in this area. Regarding dependencies,
effectively every direct dependency is org.apache, so nothing to worry about here. 

I acknowledge that I will lose dictatorial power and gain some bureaucratic / discussion overhead
by moving from sole developer to an Apache model. The benefits of a community are well worth
it. 

If we as a community decide that contrib is the right place for Graphulo, then there are lots
of logistical questions to decide like where the code will live, where JIRA will live, what
mailing lists to use, what URL to give Graphulo within apache.org , etc. We can tackle these
at our leisure. Let's discuss Graphulo and Accumulo here first. 

Warmly, 
Dylan Hutchison 

</blockquote>




Mime
View raw message