mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <...@apache.org>
Subject Re: Board Report
Date Sun, 06 Apr 2014 10:41:40 GMT
Hi Sean,

Answers inline.

On 04/06/2014 11:35 AM, Sean Owen wrote:
> I agree it's worth pausing to ask what is going on. Recent tweets and
> articles I've seen give the impression that the project is somehow
> moving entirely to Spark (or even Stratosphere?), or, entirely to H20.
> These are sweeping changes that sound very hard to reconcile.

What is going on is the process of finding the next direction for 
mahout. This process has started only recently, is still going on and 
involves talking to people and projects outside of mahout to find means 
where collaboration might be beneficial. Apache projects ought to be 
community driven and recent tweets and articles are meant to create 
attentation and answers from the community with regard to the proposed 
changes, so that we can validate whether we are going into the right 
direction.

Reactions have been quite positive so far, there is interest for 
collaboration from the Spark, H2O and Stratosphere community. And there 
has been a crowded room with no chairs left at the Hadoop Summit Europe 
last week, when Ted, Suneel and me gave a short talk describing 
potential future directions for Mahout and had a lively discussion with 
the audience for the rest of the time.

What is to be done now is to go through a process of discussion and 
experimentation.

> The reality seems more like: someone wants to add some Spark-based
> matrix stuff and someone else wants to add some H20-based matrix
> stuff. These are individually intriguing, and less hard to reconcile,
> although sound overlapping.

I think there is a big misconception here. It is not the case that 
"someone wants to add Spark-based matrix stuff". Dmitriy has been 
working for several months on a scala DSL [1] for distributed linear 
algebraic operations which allows to write algorithms in a concise, 
compact and beautiful way. A first prototype of this code is part of the 
codebase and looks very promising.

The best aspect of this dsl is that it allows to define algorithms on a 
*logical* level using a set of underlying logical operators. The benefit 
here is that this allows to abstract away the underlying execution 
system. Dmitriy already provides a prototypical runtime based on Apache 
Spark. It should be possible to integrate other systems like 
Stratosphere [2] by simply providing an implementation of the operators 
tailored to Stratosphere. In this way, users would be given the choice 
to run our algorithms on different systems without us having to maintain 
lots of different algorithm implementations.

> But then, it's not clear what happens to the rest of the code base,
> most of which is not related? Rewriting it seems far out of scope of
> available effort, and not what anyone is suggesting. I assume deleting
> it, while coherent, would be too extreme.

This is a point that needs to be discussed. With the latest release, we 
already deleted over 17,000 lines of code related to rarely used and 
unmaintained algorithms. If it is feasible to port the remaining 
distributed algorithms to a new platform depends on whether we can 
attract enough new faces to the project. That is one of the reasons why 
we talk to other projects and communities. From my personal experience I 
can say that implementing an algorithm in the new Scala DSL takes only a 
fraction of the time it takes to write it using MapReduce :)

> Speaking as a downstream consumer now, the de facto plan emerging here
> seems to be a plan to worsen, not address, the significant
> inconsistencies and problems in the code already. There would be
> undistributed, MR1, MR2, Spark, H20 code of differing flavors
> scattered around. It sounds like a step away from 1.0-readiness at a
> time when this seems to be advertised as coming soon.
> In the context of a board report, I would think it's also important to
> acknowledge this perspective, as it is almost certainly causing the
> project to be removed from a major ecosystem distributor.

What I see is a lively, community-driven discussion ongoing that has yet 
to produce a de-facto plan. I urge you and the major ecosystem 
distributor to participate in this discussion so that we can together 
produce an outcome that matches our interests.


Best,
Sebastian


[1] https://mahout.apache.org/users/sparkbindings/home.html
[2] http://stratosphere.eu/

Mime
View raw message