mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Musselman <>
Subject Re: Mahout 1.0 goals
Date Fri, 28 Feb 2014 03:13:24 GMT
I agree with b) and c); haven't used seq2sparse enough to grok a).

On Thu, Feb 27, 2014 at 6:30 PM, Suneel Marthi <>wrote:

> With the announcement of yesterday which is
> various Neural Networks implementations on Hadoop 2/JBlas that had been
> talked about in one of the other discussion threads on this mailing list.
> Do we wanna duplicate a similar effort in Mahout?
> In addition to what Dmitriy's already outlined below, I may add that one
> of the bottlenecks (in my experience) in mahout's processing pipeline is
> 'seq2sparse'.
>  a) Optimize seq2sparse to handle incremental dictionary tokens
>      - Support for
>  Deterministic Finite Automaton to speed up text processing.
>      - not using StringTuples so much in the tokenization (may result in
> some speedup)
>      - explore using Lucene 4.7 in-memory term dictionaries this may
> improve the performance substantially.
>      Even better why not use Lucene indices themselves as document
> repositories as opposed to what's being done now.
> b) Stabilize the existing Clustering algorithms - except for Simple KMeans
> the others have issues once we deviate from the 'Happy Sunday Path'
> implementation and lack adequate test coverage.
> c) RESTful interfaces for invoking classifiers/clustering.
> On Thursday, February 27, 2014 9:10 PM, Dmitriy Lyubimov <
>> wrote:
> If we approach this form purely "marketing" standpoint, i would look at it
> from two points: why is Mahout used, and why it is not used.
> Mahout is not used because it is a collection of methods that are fairly
> non-uniform in their api, especially embedded api, and generaly has zero
> encouragement to be developed on top on and incorporated in yet larger
> customizable models. I.e. it lacks semantic explicitness of quick
> prototyping, and stitching things together is next to impossible.
> Yet Mahout is used in spite of the above because it has some pretty unique
> solvers in the area of linear algebra and text topical analysis. But I
> would dare to say not e.g. because of GLM regressions.
> I personally also use Mahout e.g. in favor of something like breeze because
> it has sparse linalg support, both in-core and out-of-core, from the very
> beginning and it fits naturally unlike in any other package i ever looked
> at, R including btw.
> But i find myself heavily disassembling Mahouts into parts and bolts rather
> than exactly how e.g. MIA prescribes it.
> Bottom line here, preliminarily primary issues are ease of use,
> embedment/scripting, ease of customization, uniformity of apis.
> (1) Take semantic explicitness and scripting issue. Well i guess that's
> where the R part comes from, not because we just want to run R. I would
> clear it right away -- i don't support any sort of R integration. And not
> really because of lack of trying -- I have created a few R front ends for a
> bunch of distributed applications, and also created projects that run R in
> the backend (I wrote CrunchR more than year ago which is the same thing for
> Crunch as what SparkR is for Spark; and yet-another MR framework running R
> in backend; and also tried to run things with HadoopR). And have developed
> a pretty strong opinion that R just doesn't mix with distributed
> frameworks, mostly because of the performance penalities (and if you loose
> $5 per day in performance on a single machine it may be ok, but in 100
> machines one loses $500 a day -- and mid size companies in my experience
> are not succeptible to 'let's solve it at any HW cost" doctrine, much as
> it is generally believed the other way around.
> Anyway, on R toptic i don't see it as a solution for any sort of
> semantically explicit driver and customizer technology. There's neither
> demand nor willingness of corporate bosses to go that route. I grew pretty
> opinionated on that issue.
> But you don't need R to address semantical explicitness, customization and
> ease of integration/scripting. Pragmatically, i see scala and carefully
> crafted scala dsl as the underlying mechanism for achieving this. Also,
> internally i use scala scripting a lot and it is really easy to build shell
> interpreter for it (just like spark builds a customized shell), so one
> doesn't even need to compile these things necessarily.
> Bottom line, ideally distributed solver implementation should look more
> like matlab than java. And I would measure that goal along the lines of
> Evan Sparks' talks (i.e. in lines of code and explicitness needed to script
> out a well known method).
> See, you forced my hand to discuss solutions ("how")  :)
> (2) on the issue of minimally supported algorithms. Again, i would not see
> mlib as a prototype there.Given enough semantical explicitness, virtually
> any data scientist would script out ALS in their sleep. And every second
> one would script out weighted ALS (so called "implicit feedback). I view
> those algorithms not as a goal but rather as a guinea pig for validating
> semantical value of ML environment and apis. I would port stronger solvers
> into the new semantic ML environment over Spark rather than trying to cover
> the very "basics".
> Pragmatically i would say it would be interesting and pragmatical (for me)
> to have LDA/LSA/sparse PCA solvers ported. I would also port all clustering
> we have (albeit may be not exactly following the methodology).
> I would be also interested in giving foundation for customized hierarchical
> solutions along the lines of RLFM with various customizations including in
> particular temporal weighing of inference and customized inference of
> informative priors there. Computational Bayesian methods along the lines of
> MCEM and MCMC are said to provide a very accurate solutions here.The latter
> class of models IMO are much more interesting for practitioners of
> recommendations than pure rigid uncustomizable ALS class of models, weighed
> or not. At least Deepak Agarwal sounds very convincing in his talks.
> (3) on the issue of performance, i guess by using Spark bindings dsl you
> can't do any worse than mllib. Perhaps we could include also support for
> Dense JBlas matrices under hood of Matrix API if of interested. Also i am
> hearing using GPU libraries lately is becoming also very popular for
> performance reasons, up to 300x lin alg speed ups are reported. There are
> some fancy thoughts about cost-based optimization of algeraic expressions
> for distributed pipelines, but for the first start I will do just very
> simple physical plan substitutions (something like if i directly see A'A as
> a part of expression, or if A'B' product has small geometry then of course
> i'd rather do (BA)' etc.
> But it has potential to do more while retaining absolute degree of manually
> forced execution (thru forced checkpoints). It's just i would stop what i
> pragmatically need to script out distributed SSVD at this point.
> (4) but in general i would say the scope of your issues sounds like
> something that would close a gap between 0.5 and 1.0 rather than 0.9 and
> 1.0.
> -d
> On Thu, Feb 27, 2014 at 4:37 PM, Ted Dunning <>
> wrote:
> > I would like to start a conversation about where we want Mahout to be for
> > 1.0.  Let's suspend for the moment the question of how to achieve the
> > goals.  Instead, let's converge on what we really would like to have
> happen
> > and after that, let's talk about means that will get us there.
> >
> > Here are some goals that I think would be good in the area of numerics,
> > classifiers and clustering:
> >
> > - runs with or without Hadoop
> >
> > - runs with or without map-reduce
> >
> > - includes (at least), regularized generalized linear models, k-means,
> > random forest, distributed random forest, distributed neural networks
> >
> > - reasonably competitive speed against other implementations including
> > graphlab, mlib and R.
> >
> > - interactive model building
> >
> > - models can be exported as code or data
> >
> > - simple programming model
> >
> > - programmable via Java or R
> >
> > - runs clustered or not
> >
> >
> > What does everybody think?
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message