mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <>
Subject Re: Mahout on Spark?
Date Wed, 19 Feb 2014 12:57:23 GMT
Completely agree with Sean's statement.

On 02/19/2014 01:52 PM, Sean Owen wrote:
> To set expectations appropriately, I think it's important to point out
> this is completely infeasible short of a total rewrite, and I can't
> imagine that will happen. It may not be obvious if you haven't looked
> at the code how completely dependent on M/R it is.
> You can swap out M/R and Spark if you write in terms of something like
> Crunch, but that is not at all the case here.
> On Wed, Feb 19, 2014 at 12:43 PM, Jay Vyas <> wrote:
>> +100 for this, different execution engines, like the direction  pig and crunch take
>> Sent from my iPhone
>>> On Feb 19, 2014, at 5:19 AM, Gokhan Capan <> wrote:
>>> I imagine in Mahout offering an option to the users to select from
>>> different execution engines (just like we currently do by giving M/R or
>>> sequential options), and starting from Spark. I am not sure what changes
>>> needed in the codebase, though. Maybe following MLI (or alike) and
>>> implementing some more stuff, such as common interfaces for iterating over
>>> data (the M/R way and the Spark way).
>>> IMO, another effort might be porting pre-online machine learning (such
>>> transforming text into vector based on the dictionary generated by
>>> seq2sparse before), machine learning based on mini-batches, and streaming
>>> summarization stuff in Mahout to Spark-Streaming.
>>> Best,
>>> Gokhan
>>> On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov <>wrote:
>>>> PS I am moving along cost optimizer for spark-backed DRMs on some
>>>> multiplicative pipelines that is capable of figuring different cost-based
>>>> rewrites and R-Like DSL that mixes in-core and distributed matrix
>>>> representations and blocks but it is painfully slow, i really only doing
>>>> like couple nights in a month. It does not look like i will be doing it on
>>>> company time any time soon (and even if i did, the company doesn't seem to
>>>> be inclined to contribute anything I do anything new on their time). It is
>>>> all painfully slow, there's no direct funding for it anywhere with no
>>>> string attached. That probably will be primary reason why Mahout would not
>>>> be able to get much traction compared to university-based contributions.
>>>> On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov <
>>>>> wrote:
>>>>> Unfortunately methinks the prospects of something like Mahout/MLLib merge
>>>>> seem very unlikely due to vastly diverged approach to the basics of
>>>> linear
>>>>> algebra (and other things). Just like one cannot grow single tree out
>>>>> two trunks -- not easily, anyway.
>>>>> It is fairly easy to port (and subsequently beat) MLib at this point
>>>>> collection of algorithms point of view. But IMO goal should be more
>>>>> MLI-like first, and port second. And be very careful with concepts.
>>>>> Something that i so far don't see happening with MLib. MLib seems to
>>>>> old-style Mahout-like rush to become a collection of basic algorithms
>>>>> rather than coherent foundation. Admittedly, i havent looked very
>>>> closely.
>>>>> On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter <
>>>>> wrote:
>>>>>> I'm also convinced that Spark is a superior platform for executing
>>>>>> distributed ML algorithms. We've had a discussion about a change
>>>>>> Hadoop to another platform some time ago, but at that point in time
>>>> was
>>>>>> not clear which of the upcoming dataflow processing systems (Spark,
>>>>>> Hyracks, Stratosphere) would establish itself amongst the users.
To me
>>>> it
>>>>>> seems pretty obvious that Spark made the race.
>>>>>> I concur with Ted, it would be great to have the communities work
>>>>>> together. I know that at least 4 mahout committers (including me)
>>>>>> already following Spark's mailinglist and actively participating
in the
>>>>>> discussions.
>>>>>> What are the ideas how a fruitful cooperation look like?
>>>>>> Best,
>>>>>> Sebastian
>>>>>> PS:
>>>>>> I ported LLR-based cooccurrence analysis (aka item-based recommendation)
>>>>>> to Spark some time ago, but I haven't had time to test my code on
>>>> large
>>>>>> dataset yet. I'd be happy to see someone help with that.
>>>>>>> On 02/19/2014 08:04 AM, Nick Pentreath wrote:
>>>>>>> I know the Spark/Mllib devs can occasionally be quite set in
ways of
>>>>>>> doing certain things, but we'd welcome as many Mahout devs as
>>>> to
>>>>>>> work together.
>>>>>>> It may be too late, but perhaps a GSoC project to look at a port
>>>> some
>>>>>>> stuff like co occurrence recommender and streaming k-means?
>>>>>>> N
>>>>>>> --
>>>>>>> Sent from Mailbox for iPhone
>>>>>>> On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning <>
>>>>>>> wrote:
>>>>>>> On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath <
>>>>>>>>> My (admittedly heavily biased) view is Spark is a superior
>>>>>>>>> overall
>>>>>>>>> for ML. If the two communities can work together to leverage
>>>>>>>>> strengths
>>>>>>>>> of Spark, and the large amount of good stuff in Mahout
(as well as
>>>> the
>>>>>>>>> fantastic depth of experience of Mahout devs) I think
a lot can be
>>>>>>>>> achieved!
>>>>>>>>> It makes a lot of sense that Spark would be better than
Hadoop for
>>>> ML
>>>>>>>> purposes given that Hadoop was intended to do web-crawl kinds
>>>> things
>>>>>>>> and
>>>>>>>> Spark was intentionally built to support machine learning.
>>>>>>>> Given that Spark has been announced by a majority of the
>>>>>>>> distribution vendors, it makes sense that maybe Mahout should
jump in.
>>>>>>>> I really would prefer it if the two communities (MLib/MLI
and Mahout)
>>>>>>>> could
>>>>>>>> work more closely together.  There is a lot of good to be
had on both
>>>>>>>> sides.

View raw message