mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Samsara's learning curve
Date Mon, 27 Mar 2017 17:32:39 GMT
I believe writing in the DSL is simple enough, especially if you have some
familiarity with Scala on top of R (or, in my case, R on top of Scala
perhaps:). I've implemented about couple dozens customized algorithms that
used distributed Samsara algebra at least to some degree, and I think I can
reliably attest none of them ever exceeded 100 lines or so, and that it
significantly reduced my time dedicated to writing algebra on top of Spark
and some other backends I use under proprietary settings. I am now mostly
doing non-algebraic improvements because writing algebra is easy.

The most difficult part however, at least for me, and as you can see as you
go along with the  book, was not the pecularities of R-like bindings, but
the algorithm reformulations. Traditional "in-memory" algorithms do not
work on shared-nothing backends, even though you could program them, they
simply will not perform.

The main reasons some of the traditional algorithms do not work at scale
are because they either require random memory access, or (more often) are
simply super-linear w.r.t. input size, so as one scales  infrastructure at
linear cost, one would still incur less than expected increment in
performance (if any at all, at some point) per unit of input.

Hence, usually some mathematically, or should i say, statistically
motivated tricks are still required. As the book describes, linearly or
sub-linearly scalable sketches, random projections, dimensionality
reductions etc. etc. are required to alleviate scalability issues of the
super-linear algorithms.

To your question, i got couple of people doing some pieces on various
projects before with Samsara, but they had me as a coworker. I am
personally not aware of any outside developers beyond people already on the
project @ Apache and my co-workers, although in all honesty i feel it has
to do more with maturity and modest marketing of the public version of
Samsara than necessarily the difficulty of adoption.

-d



On Sun, Mar 26, 2017 at 9:15 AM, Gustavo Frederico <
gustavo.frederico@thinkwrap.com> wrote:

> I read Lyubimov's and Palumbo's book on Mahout Samsara up to chapter 4
> ( Distributed Algebra ). I have some familiarity with R, I did study
> linear algebra and calculus in undergrad. In my master's I studied
> statistical pattern recognition and researched a number of ML
> algorithms in my thesis - spending more time on SVMs. This is to ask:
> what is the learning curve of Samsara? How complicated is to work with
> distributed algebra to create an algorithm? Can someone share an
> example of how long she/he took to go from algorithm conception to
> implementation?
>
> Thanks
>
> Gustavo
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message