mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <>
Subject Re: 0xdata interested in contributing
Date Fri, 14 Mar 2014 18:01:09 GMT
Praise belongs to Dmitriy, The sparkbindings was his work, not mine.

On 03/14/2014 06:58 PM, Pat Ferrel wrote:
> Not at all. The effect on the community is exactly what I’m most worried about. The
effect on community will be far worse if we reset based on architectural talk alone. It involves
not just Mahout’s community but Spark’s and 0xdata’s.
> I think people (including me) have underestimated how much you and Sebastian have done
on Spark. Realistically it sounds like we are talking about walking away from that in favor
of an unknown.
> 0xdata’s community has not been solving the problems I care about. You guys have.
> On Mar 14, 2014, at 10:41 AM, Dmitriy Lyubimov <> wrote:
> I think you still miss the point.
> The performance comparison will be a function of performance comparison of
> Mahout's in core algebra to that of something else.   Most likely as it
> stands it will not be better than 0xdata, but it will not be 100x either
> and it will be entirely function of the in-core linear algebra in Mahout,
> not that of Spark or any other memory-based engine. Spark, specifically,
> has focused on low startup costs, so it is unlikely anything will beat it
> sensibly in that department.
> But if we measure effect on community, which is stated goal of this merger,
> you can't benchmark it with computer. I am trying to predict it by drawing
> analysis of other projects that traveled well in that department. What made
> them doing well. And it was largely not performance -- not a single digit
> factor anyway.
> On Fri, Mar 14, 2014 at 10:14 AM, Pat Ferrel <> wrote:
>> Isn't there some work on RSJ on Spark? Can we compare that to something
>> 0xdata can "knock off"?
>> On Mar 14, 2014, at 10:08 AM, Sebastian Schelter <> wrote:
>> Dmitriy,
>> I share a lot your concerns expressed here. I hear more complaints about
>> Mahout being too inaccessible and too hard to customize for use cases and
>> inputs more than complaints about it being too slow. I also concur with
>> your analysis that the clear and accessible programming model is what
>> causes Spark's popularity.
>> I'm also not a fan of sacrificing a programming model for performance, I
>> also consider this the main drawback of Graphlab. Its superfast for a
>> certain set of problems, but it constrains you to a vertex centric
>> programming model, into which a lot of things hardly fit.
>> On 03/14/2014 03:21 PM, Dmitriy Lyubimov wrote:
>>>> I think that the proposal under discussion involves adding a dependency
>> on
>>>> a maven released h2o artifact plus a contribution of Mahout translation
>>>> layers.  These layers would give a sub-class of Matrix (and Vector)
>> which
>>>> allow direct control over life span across multiple jobs but would
>>>> otherwise behave like their in-memory counter-parts.
>>> Well I suppose that means they have to live in some processes which are
>> not
>>> processes I already have. And they have to be managed. So this is not
>> just
>>> an in-core subsystem. Sounds like a new back to me.
>>>>> In Hadoop, every iteration must be scheduled as a separate job, rereads
>>>>> invariant data and materializes its result to hdfs. Therefore,
>> iterative
>>>>> programs on Hadoop are an order of magnitude slower than on systems
>> that
>>>>> have dedicated support for iterations.
>>>>> Does h2o help here or would we need to incorporate another system for
>>> such
>>>>> tasks?
>>>> H2o helps here in a couple of different ways.
>>>> The first and foremost is that primitive operations are easy
>>>> Additionally, data elements can survive a single programs execution.
>> This
>>>> means that programs can be executed one after another to get composite
>>>> effects.  This is astonishingly fast ... more along the speeds one would
>>>> expect from a single processor program.
>>> I think the problem here is that the authors keep comparing these
>>> techniques to slowest model available which is Hadoop.
>>> But this is exact execution model of Spark. You get stuff repeatedly
>>> executed on in-memory partitions and get approximately the speed of
>>> iterative speed execution.  I won't describe it as astonishing, though,
>>> because indeed it is as fast as you can get things done in memory, no
>>> faster, no slower. That's for example the reason why my linalg optimizer
>> is
>>> not hesitating to compute exact matrix geometry lazily if not known, for
>>> optimization purposes, because the answer will be back in between 40 to
>> 200
>>> ms, assuming adequate RAM allocation. I have been using these paradigms
>> for
>>> more than a year now. This is all good stuff. I would not use word
>>> astonshing, but sensible, yes. Main concern is if programming model is
>>> called to be sacrificed just to do sensible things here.
>>>>> (2) Efficient join implementations
>>>>> If we look at a lot of Mahout's algorithm implementations with a
>>> database
>>>>> hat on, than we see lots of handcoded joins in our codebase, because
>>> Hadoop
>>>>> does not bring join primitives. This has lots of drawbacks, e.g. it
>>>>> complicates the codebase and leads to hardcoded join strategies that
>>> bake
>>>>> certain assumptions into the code (e.g. ALS uses a broadcast-join which
>>>>> assumes that one side fits into memory on each machine, RecommenderJob
>>> uses
>>>>> a repartition-join which is scalable but very slow for small
>>> inputs,...).
>>> +1
>>>> I think that h2o provides this but do not know in detail how.  I do know
>>>> that many of the algorithms already coded make use of matrix
>>> multiplication
>>>> which is essentially a join operation.
>>> Essentially a join? The spark module optimizer picks out of at least 3
>>> implementations: zip+combine, block-wise cartesian and finally, yes,
>>> join+combine. Depends on orientation and the earlier operators in
>> pipeline.
>>> That's exactly my point about flexibility of programming model from the
>>> optimizer point of view.
>>>>> Obviously, I'd love to get rid of handcoded joins and implement ML
>>>>> algorithms (which is hard enough on its own). Other systems help with
>>> this
>>>>> already. Spark, for example offers broadcast and repartition-join
>>>>> primitives, Stratosphere has a join primitive and an optimizer that
>>>>> automatically decides which join strategy to use, as well as a highly
>>>>> optimized hybrid hashjoin implementation that can gracefully go
>>> out-of-core
>>>>> under memory pressure.
>>>> When you get into the realm of things on this level of sophistication, I
>>>> think that you have found the boundary where alternative foundations
>> like
>>>> Spark and Stratosphere are better than h2o.  The novelty with h2o is the
>>>> hypothesis that a very large fraction of interesting ML algorithms can
>> be
>>>> implemented without this power.  So far, this seems correct.
>>> Again, this is largely along the lines "let's make a library of few
>>> hand-optimized things". Which is noble, but -- I would argue -- not
>>> ambitious enough. Most of the distributed ML projects do just that. We
>>> should perhaps think along the lines what could be differentiating factor
>>> for us.
>>> Not that we should not care about performance. It should be, of course,
>>> *sensible*. (Our MR code base of course does not give us that, as u said,
>>> jumping off MR wagon is not even a question).
>>> If you can forgive me for drawing parallels here, it's a difference
>> between
>>> something like Weka and R. Collection vs. platform _and_ collection
>> induced
>>> by platform. Platform of course also positively feeds into the speed of
>>> collection growth directly.
>>> When i use R, i don't have code consisting of algorithms calls. That is,
>>> yes, it is doing off-the shelf use now and then, but it is far from being
>>> the only thing  it is doing. 95% of the things is as simple feature
>>> massaging. I place no value in R for providing GLM for me. Gosh, this
>>> particular offering is virtually hanging from anywhere these days.
>>> But i do place value into it for doing custom feature prep and for, for
>>> example being able to get 100 grad students to try their own k-means
>>> implementation in seconds.
>>> Why?
>>> There has been a lot of talk here about building community and
>>> contributions etc. Platform is what builds it, most directly and
>> amazingly.
>>> I would go on a limb here and say that Spark and mlib are experiencing
>>> explosive growth of contributions not because it can do things with
>>> in-memory datasets (which is important, but like i said, is has been long
>>> since viewed no more than just sensible), but because of clarity of its
>>> programming model. I think we have seen a very solid evidence that
>> clarity
>>> and richness of programming model was the thing that attracts
>> communities.
>>> If we grade roughly (very roughly!) what we have today, I can easily
>> argue
>>> that the acceptance levels follow the programming model very closely.
>> e.g.
>>> if i try to sort project with distributed programming models by (my
>>> subjectively percieved) popularity, from bottom to top :
>>> ********
>>> Hadoop MapReduce -- ok i don't even know how to organize the critique
>> here,
>>> too long of a list, almost nobody (but Mahout) does these things this way
>>> today. Certainly, none of my last 2 employers did.
>>> hive -- SQL like with severly constrained general programming language
>>> capabilities, not conducive to batches. Pretty much limits to ad-hoc
>>> exploration.
>>> Pig -- a bit better, can write batches, but extra functionality mixins
>>> (UDFs) are still a royal pain
>>> Cascading -- even easier, rich primitives, easy batches, some manual
>>> optimization of physical plan elements. One of the big cons is the
>>> limitation of a rigid dataset tuple structure,
>>> FlumeJava (Crunch in apache world) -- even better, but java closures are
>>> just plain ugly, zero "scriptability". Its community has been hurt a
>> little
>>> bit because of the fact that it was a bit late to the show compared to
>>> others (e.g. cascading), but it leveled off quickly.
>>> Scala bindings for Cascading (Scalding) and FlumeJava -- better, hell,
>> well
>>> better on the closure and FP front! But still not being native to scala
>>> from get go creates some miniature problems there.
>>> Spark -- i think is fair to say  the current community "king" above those
>>> all -- all the aforementioned platform model pains are eliminated,
>> although
>>> on performance side i think there're still some pockets for improvement
>> on
>>> cost-based optimization side of things.
>>> Stratosphere might be more interesting in this department, but I am not
>>> sure at this point if that necessarily will translate into performance
>>> benefits for ML.
>>> ********
>>> The first few things are using the same computing model underneath and
>>> essentially are having roughly the same performance. Yet there's clear
>>> variation in community and acceptance.
>>> In ML world, we are seeing approximately the same thing. The clearer the
>>> programming model and ease of integration in to the process, the wider
>> the
>>> acceptance. I probably can pretty successfully argue that current most
>>> performant ML "thing" as it stands is GraphLab. And it is pretty
>>> comprehensive in problem coverage (I think it does cover e.g. recommender
>>> concerns greater than h2o and Mahout together, for example). But i can
>> also
>>> pretty successfully argue it is being rejected a lot of time for being
>> just
>>> a collection (which is, in addition, is hard to call from jvm, i.e.
>>> integration again). It is actually so bad, that people in my company
>> would
>>> rather go back to 20 snow wired R servers than think of even entertaining
>>> an architecture including GraphLab component. (Yes, variance of this
>> sample
>>> as high as it gets, just saying what i hear).
>>> So as a general guideline to solve the current ills, it would stand to
>>> reason to adopt platform priority and algorithm collection as a function
>> of
>>> such platform, rather than collection as a function of few dedicated
>>> efforts. Yes -- it has to be *sensibly* performant -- but this does not
>>> have to be mostly a concern of the code in this project directly. Rather,
>>> it has to be a concern of the backs (i.e. dependencies) and our in-core
>>> support.
>>> Our pathological fear of being a performance scapegoat totally obscurs
>> the
>>> fact that performance is mostly a function of the back and that we were
>>> riding on a wrong back for a long time. As long as we don't cling to a
>>> particular back, it shouldn't be a problem. What one would rather accept:
>>> being initially 5x slower than Graphlab (but on par with MLlib) but beat
>>> these on community support, or being on par but anemic in community? If
>> 02
>>> platform feels the performance has been so important to sacrifice
>>> programming model, why they feel the need to join an apache project?
>> After
>>> all, they have been an open project for a long time already and have
>> built
>>> their own community, big or small. Spark has just now become a top-level
>>> apache project, and joined apache incubator mere 2 months ago and did not
>>> have any trouble attracting community outside Apache at all. Stratosphere
>>> is not even in Apache. Similarly, did it help Mahout to be in Apache to
>> get
>>> anywhere close in community measurement to these? So this totally refutes
>>> the argument one has to be an Apache project to get its exclusive
>> qualities
>>> highlighted. Perhaps in the end it is more about the importance of the
>>> qualities to the community and quality of contributions.
>>> A lot of this platform and programming model priority is probably easier
>> to
>>> say than do, but some of linalg and data frame things are ridiculously
>> easy
>>> though in terms of amount of effort. If i could do linalg optmizer with
>>> bindings for sparks with 2 nights a month, the same can be done for
>>> multiple backs and data frames in a jiffy. Well, the back should have a
>>> clear programming model of course as a prerequisite. Which brings us back
>>> to the issue of richness of distributed primitives.

View raw message