mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <>
Subject Re: Mahout 1.0 goals
Date Fri, 28 Feb 2014 02:08:08 GMT
If we approach this form purely "marketing" standpoint, i would look at it
from two points: why is Mahout used, and why it is not used.

Mahout is not used because it is a collection of methods that are fairly
non-uniform in their api, especially embedded api, and generaly has zero
encouragement to be developed on top on and incorporated in yet larger
customizable models. I.e. it lacks semantic explicitness of quick
prototyping, and stitching things together is next to impossible.

Yet Mahout is used in spite of the above because it has some pretty unique
solvers in the area of linear algebra and text topical analysis. But I
would dare to say not e.g. because of GLM regressions.

I personally also use Mahout e.g. in favor of something like breeze because
it has sparse linalg support, both in-core and out-of-core, from the very
beginning and it fits naturally unlike in any other package i ever looked
at, R including btw.

But i find myself heavily disassembling Mahouts into parts and bolts rather
than exactly how e.g. MIA prescribes it.

Bottom line here, preliminarily primary issues are ease of use,
embedment/scripting, ease of customization, uniformity of apis.

(1) Take semantic explicitness and scripting issue. Well i guess that's
where the R part comes from, not because we just want to run R. I would
clear it right away -- i don't support any sort of R integration. And not
really because of lack of trying -- I have created a few R front ends for a
bunch of distributed applications, and also created projects that run R in
the backend (I wrote CrunchR more than year ago which is the same thing for
Crunch as what SparkR is for Spark; and yet-another MR framework running R
in backend; and also tried to run things with HadoopR). And have developed
a pretty strong opinion that R just doesn't mix with distributed
frameworks, mostly because of the performance penalities (and if you loose
$5 per day in performance on a single machine it may be ok, but in 100
machines one loses $500 a day -- and mid size companies in my experience
 are not succeptible to 'let's solve it at any HW cost" doctrine, much as
it is generally believed the other way around.

Anyway, on R toptic i don't see it as a solution for any sort of
semantically explicit driver and customizer technology. There's neither
demand nor willingness of corporate bosses to go that route. I grew pretty
opinionated on that issue.

But you don't need R to address semantical explicitness, customization and
ease of integration/scripting. Pragmatically, i see scala and carefully
crafted scala dsl as the underlying mechanism for achieving this. Also,
internally i use scala scripting a lot and it is really easy to build shell
interpreter for it (just like spark builds a customized shell), so one
doesn't even need to compile these things necessarily.

Bottom line, ideally distributed solver implementation should look more
like matlab than java. And I would measure that goal along the lines of
Evan Sparks' talks (i.e. in lines of code and explicitness needed to script
out a well known method).

See, you forced my hand to discuss solutions ("how")  :)

(2) on the issue of minimally supported algorithms. Again, i would not see
mlib as a prototype there.Given enough semantical explicitness, virtually
any data scientist would script out ALS in their sleep. And every second
one would script out weighted ALS (so called "implicit feedback). I view
those algorithms not as a goal but rather as a guinea pig for validating
semantical value of ML environment and apis. I would port stronger solvers
into the new semantic ML environment over Spark rather than trying to cover
the very "basics".

Pragmatically i would say it would be interesting and pragmatical (for me)
to have LDA/LSA/sparse PCA solvers ported. I would also port all clustering
we have (albeit may be not exactly following the methodology).

I would be also interested in giving foundation for customized hierarchical
solutions along the lines of RLFM with various customizations including in
particular temporal weighing of inference and customized inference of
informative priors there. Computational Bayesian methods along the lines of
MCEM and MCMC are said to provide a very accurate solutions here.The latter
class of models IMO are much more interesting for practitioners of
recommendations than pure rigid uncustomizable ALS class of models, weighed
or not. At least Deepak Agarwal sounds very convincing in his talks.

(3) on the issue of performance, i guess by using Spark bindings dsl you
can't do any worse than mllib. Perhaps we could include also support for
Dense JBlas matrices under hood of Matrix API if of interested. Also i am
hearing using GPU libraries lately is becoming also very popular for
performance reasons, up to 300x lin alg speed ups are reported. There are
some fancy thoughts about cost-based optimization of algeraic expressions
for distributed pipelines, but for the first start I will do just very
simple physical plan substitutions (something like if i directly see A'A as
a part of expression, or if A'B' product has small geometry then of course
i'd rather do (BA)' etc.

But it has potential to do more while retaining absolute degree of manually
forced execution (thru forced checkpoints). It's just i would stop what i
pragmatically need to script out distributed SSVD at this point.

(4) but in general i would say the scope of your issues sounds like
something that would close a gap between 0.5 and 1.0 rather than 0.9 and

On Thu, Feb 27, 2014 at 4:37 PM, Ted Dunning <> wrote:

> I would like to start a conversation about where we want Mahout to be for
> 1.0.  Let's suspend for the moment the question of how to achieve the
> goals.  Instead, let's converge on what we really would like to have happen
> and after that, let's talk about means that will get us there.
> Here are some goals that I think would be good in the area of numerics,
> classifiers and clustering:
> - runs with or without Hadoop
> - runs with or without map-reduce
> - includes (at least), regularized generalized linear models, k-means,
> random forest, distributed random forest, distributed neural networks
> - reasonably competitive speed against other implementations including
> graphlab, mlib and R.
> - interactive model building
> - models can be exported as code or data
> - simple programming model
> - programmable via Java or R
> - runs clustered or not
> What does everybody think?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message