mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: Board Report
Date Mon, 07 Apr 2014 18:08:47 GMT
Right! Even more black box is needed. Mahout is not for scientists, it is for app devs, some
of which are trying desperately to learn the math. Not people coming from R and yearning to
do scalable app dev.

I guess I’ve already said that the Mahout reboot should be first on Spark and leave the
other engines to work out their own integrations. 

On Apr 7, 2014, at 10:45 AM, Sebastian Schelter <ssc@apache.org> wrote:

A few questions here.

@Dmitriy I very much share your vision. I just think that our targeted userbase is more than
people wanting to implement their own algorithms using high level constructs in a modern language
like Scala. 

I think there is still a huge demand for a blackbox of algorithms that allows to easily build
a model without having to know too much about the underlying math. Our recommenders are a
good example. Provide data in a simple CSV format, throw ItemSimilarityJob on that and use
the item similarites for recommendations.

Do you suggest we should leave the blackbox stuff to MLBase/Oryx and solely focus on providing
high level ML constructs?

@Sean How much can you agree on the vision I suggested? It meets your demand of having a plan
to solve the problems with the MR codebase (by getting rid of it in the near future) and provides
a direction for Spark as the new underlying execution system, with optional support for Stratosphere
and H20, if those communities manage to convince us that it is worth to integrate.

--sebastian




2014-04-07 19:29 GMT+02:00 Pat Ferrel <pat@occamsmachete.com>:
The document does not mention the state of the existing Spark work in the snapshot codebase.
Shouldn’t this be noted?

On Apr 7, 2014, at 5:06 AM, Sebastian Schelter <ssc@apache.org> wrote:

I think we should mention the redesign/rework of the website and the completion of the move
from the old wiki to Apache CMS.

--sebastian

On 04/07/2014 02:04 PM, Grant Ingersoll wrote:
> Here is my proposed report.  For the most part, I think the only right thing to do vis-a-vis
the Board is to report that we are in the midst of a healthy (yes, I believe it is, for the
most part healthy and normal) discussion on where to go next.
>
> PMC Members: this is checked into SVN at https://svn.apache.org/repos/asf/mahout/pmc/board-reports/2014/board-report-apr.txt.
 It is due on Wednesday.  If you object to this approach of reporting, please let me know
ASAP and suggest alternatives.
>
> === Apache Mahout Status Report: April 2014 ===
>
> -----
>
> Apache Mahout has implementations of a wide range of machine learning and
> data mining algorithms: clustering, classification, collaborative filtering
> and frequent pattern mining
>
> Project Status
> --------------
>
> The project continues to have a large and active user base.  While
> the developer base has continued to grow, there is a very active
> and healthy debate going on about where Mahout goes next.  Please
> see the Issues section below for more details.
>
> Community
> ---------
>
> * Andrew Musselman was voted in as new committer.
> * No changes to the PMC in the reporting period.
>
> * The main issue concerning the community right now is the addition
> of new contributions from 0xData and the integration of Mahout with Spark.
>
> Community Objectives
> --------------------
>
> Our goal is to build scalable machine learning libraries. See the Issues
> section below for the debate in the community about our objectives.
>
>
> Releases
> --------
>
> In addition to an ongoing debate on Mahout's future, the community is actively
>  working on integrating Mahout with Scala/Spark, updating
> documentation, and bringing in new code and committers to update the core project.
>
>
> Issues
> ------
> The Mahout community is at a crossroads in terms of where
> to go next.  While the project has a broad number of users and interested
> parties, most committers are trying to maintain the code base on a purely
> part time basis, when the amount of work to sustain these users
> clearly points to it needing to
> be full time.  Furthermore, much of our original code base is written
> for Hadoop MapReduce 1.0, which many in the community have come to realize
> is not well-suited for solving the kinds of problems that Mahout has set
> out to solve.  There have been several lengthy discussions and prototypes
> going on to work out next directions along the lines of the Spark and
> 0xData contributions (there are numerous threads on the dev@mahout.a.o
> mailing list.)
>
> The PMC does not think this requires Board intervention at this time
> as the debate is, as far as we can tell, healthy.  We do, however,
> expect that this debate will take some time to resolve and may mean we
> won't be shipping a 1.0 release any time soon.  We will keep the Board
> apprised of our next steps as we work through the process.
>
>
>
>
> On Apr 7, 2014, at 4:53 AM, Grant Ingersoll <gsingers@apache.org> wrote:
>
>> To Sean's point, if Mahout were "my company", I would do the following, albeit pragmatic
and not so pleasant thing, assuming, of course, I had the $$$ to do so:
>>
>> 1. Clean up existing code with a laser focus on a few key areas (Sebastian's list
makes sense) using a part of the team and call it 1.0 and ship it, as it has a number of users
and they deserve to not have the rug pulled out from under them.
>>
>> 2. Spin out a subset of the team to explore and prototype 2.0 based on two very positive
and re-energizing looking ideas:
>>      a. Scala DSL (and maybe Spark)
>>      b. 0xData
>>
>>      All of the work for #2 would be done in a clean repo and would only bring in
legacy code where it was truly beneficial (back compat. can come later, if at all).
>>      It would then benchmark those two approaches as well as look at where they overlap
and are mutually beneficial and then go forward with the winner.
>>
>> 3. Once #2 is viable, put most effort into it and maintain 1.0 with as minimal support
as possible, encouraging, neh -- actively helping -- 1.0 customers upgrade as quickly as possible.
>>
>> The tricky part then becomes how do you make sure to still make your sales #'s while
also convincing them that your roadmap is what they are really buying.
>>
>> If I didn't have the $$$ to do both of these (i.e. we need a massive turn around
and we have one last shot), I would be all in on #2.
>>
>> -----------------------------------
>>
>> That being said, Mahout is not "my company".  Heck, Mahout is not even a "company",
so we don't need to be bound by company conventions and thought processes, even if that fits
with all of our individual day jobs.  And, thankfully, we don't have any sales numbers to
make.
>>
>> We are chartered with one and only one mission: produce open source, scalable machine
learning libraries under the Apache license and community driven principles.  We are not required
by the Board or anyone else to support version X for Y years or to use Hadoop or Scala or
Java.  We are also not required to implement any specific algorithms or deliver them on specific
time frames.  We are also not required to provide users upgrade paths or the like.  Naturally,
we _want_ to do these things for the sake of the community, but let's be clear: it is not
a requirement from the ASF.  We are, however, required, to have a sustaining community.
>>
>> ------------------------------------
>>
>> I personally think we should start clean on #2, throwing off the shackles of the
past and emerge 6-9 months later with Mahout 2.0 (and yes, call it that, not 0.1 as Sebastian
suggests, for marketing reasons) built on a completely new and fresh repository, likely bringing
in only the Math/collections underpinnings and maybe the build system.  This new repository
would have only a handful of core algorithms that we know are well implemented, sustainable
and best in class.
>>
>> I think we should look at the lead up to 0.9 as an experiment that proved out a lot
of interesting ideas, including the fact that Mahout proved there is vast interest in open
source large scale machine learning and that it is the benchmark for comparison.  Not many
other ML projects can say that, even if they have better technical implementations or are
less fragmented.  Once you realize something has outlived it's usefulness in software, however,
there is no point in lingering.
>>
>> That being said, at least for the foreseeable future, I am not in a position to contribute
much code.  So, from my perspective, the ASF Meritocratic approach takes over:  those who
do the work make the decisions.  If you want something in, then put up the patch and ask for
feedback.  If no one provides feedback, assume lazy consensus and move forward.  Nothing convinces
people better than actual, real, executing code.  For my part, I am happy to continue to work
the bureaucratic side of things to make sure reports get filed, credentials get created, etc.
and the occasional patch.  I hope one day I will have time to contribute again.
>>
>> I will follow up w/ a separate email on what I am going to put in the Board Report.
>>
>> On Apr 7, 2014, at 1:52 AM, Sean Owen <srowen@gmail.com> wrote:
>>
>>> No, it's about the opposite. I'm referring to the default, current
>>> state of play here.
>>>
>>> The issues for a vendor are demand and supportability. Do people want
>>> to pay for support of X? Can you honestly say you have expertise to
>>> support and influence X over at least a major release cycle (12-18
>>> months)? The latter needs a reasonably reliable roadmap and
>>> continuity.
>>>
>>> I'm suggesting that in the current state, demand is low and going
>>> down. The current code base seems de facto deprecated/unsupported
>>> already, and possibly to be removed or dramatically changed into
>>> something as-yet unclear. Nobody here seems to have taken a hard
>>> decision regarding a next major release, but, the trajectory of that
>>> decision seems clear if the current state remains the same.
>>>
>>> From my perspective, "middle-ground" new directions like adding a bit
>>> of H2O, a bit of Spark, leaving bits of M/R code around, etc. are only
>>> worse. I can see why there may be a little renewed demand for the new
>>> bits, but then, why not go all in on one of them?
>>>
>>> Because a substantially all-new direction is a different story. If a
>>> "Mahout2O" or "Spahout" ("Mark"?) emerges as a plan, I could imagine a
>>> lot of renewed demand. And a clearer underlying roadmap sounds
>>> possible. It would remain to be seen, but there's nothing stopping
>>> those ideas from becoming part of a distro too.
>>>
>>>
>>> On Mon, Apr 7, 2014 at 6:22 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
>>>> Please be explicit here.  It sounds like you are saying that if Mahout goes
>>>> in the proposed new direction that Cloudera will drop Mahout.
>>>>
>>>> Is that what you mean to say?
>>
>>
>
> --------------------------------------------
> Grant Ingersoll | @gsingers
> http://www.lucidworks.com
>
>
>
>
>





Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message