mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Palumbo <ap....@outlook.com>
Subject Re: Codebase refactoring proposal
Date Wed, 04 Feb 2015 21:51:48 GMT

On 02/04/2015 03:37 PM, Dmitriy Lyubimov wrote:
> Re: Gokhan's PR post: here are my thoughts but i did not want to post it
> there since they are going beyond the scope of that PR's work to chase the
> root of the issue.
>
> on quasi-algebraic methods
> ========================
>
> What is the dilemma here? don't see any.
>
> I already explained that no more than 25% of algorithms are truly 100%
> algebraic. But about 80% cannot avoid using some algebra and close to 95%
> could benefit from using algebra (even stochastic and monte carlo stuff).
>
> So we are building system that allows us to cut developer's work by at
> least 60% and make his work also more readable by 3000%. As far as I am
> concerned, that fulfills the goal. And I am perfectly happy writing a mix
> of engine-specific primitives and algebra.
>
> That's why i am a bit skeptical about attempts to abstract non-algebraic
> primitives such as row-wise aggregators in one of the pull requests.
> Engine-specific primitives and algebra can perfectly co-exist in the guts.
> And that's how i am doing my stuff in practice, except i now can skip 80%
> effort on algebra and bridging incompatible intputs-outputs.
I am **definitely** not advocating messing with the algebraic 
optimizer.  That was what I saw as the plus side to Gokhan's PR- a 
separate engine abstraction for qasi/non-algebraic distributed methods. 
   I didn't comment on the PR either because admittedly I did not have a 
chance to spend a lot of time on it.  But my quick takeaway was  that we 
could take some very useful and hopefully (close to) ubiquitous 
distributed operators and pass them through to the engine "guts".

I briefly looked through some of the flink and h2o code and noticed 
Flink's aggregateOperator [1]
and h2o's MapReduce API and [2] my thought was that we could write pass 
through operators for some of the more useful operations from math-scala 
and then implement them fully in their respective packages.  Though I am 
not sure how this would work on either cases w.r.t. partitioning.  e.g. 
on h2o's distributed DataFrame. or flink for that matter.  Again, I 
havent had alot of time to look at these and see if this would work at all.

My thought was not to bring primitive engine specific aggregetors, 
combiners,  etc. into math-scala.

I had thought though that we were trying to develop a fully engine 
agnostic algorithm library in on top of the R-Like distributed BLAS.


So would the idea be to implement i.e. seq2sparse fully in the spark 
module?  It would seem to fracture the project a bit.


Or to implement algorithms sequentially if mapBlock() will not suffice 
and then optimize them in their respective modules?


>
> None of that means that R-like algebra cannot be engine agnostic. So people
> are unhappy about not being able to write the whole in totaly agnostic way?
> And so they (falsely) infer the pieces of their work cannot be helped by
> agnosticism individually, or the tools are not being as good as they might
> be without backend agnosticism? Sorry, but I fail to see the logic there.
>
> We proved algebra can be agnostic. I don't think this notion should be
> disputed.
>
> And even if there were a shred of real benefit by making algebra tools
> un-agnostic, it would not ever outweigh tons of good we could get for the
> project by integrating with e.g. Flink folks. This one one the points MLLib
> will never be able to overcome -- to be truly shared ML platform where
> people could create and share ML, but not just a bunch of ad-hoc spaghetty
> of distributed api calls and Spark-nailed black boxes.
>
> Well yes methodology implementations will still have native distributed
> calls. Just not nearly as many as they otherwise would, and will be much
> more easier to support on another back-end using Strategy patterns. E.g.
> implicit feedback problem that i originally wrote as quasi-method for Spark
> only, would've taken just an hour or so to add strategy for flink, since it
> retains all in-core and distributed algebra work as is.
>
> Not to mention benefit of single type pipelining.
>
> And once we add hardware-accelerated bindings for in-core stuff, all these
> methods would immediately benefit from it.
>
> On MLLib interoperability issues,
> =========================
>
> well, let me ask you this: what it means to be MLLib-interoperable? is
> MLLib even interoperable within itself?
>
> E.g. i remember there was one most frequent request on the list here: how
> can we cluster dimensionally-reduced data?
>
> Let's look what it takes to do this in MLLib: First, we run tf-idf, which
> produces collection of vectors (and where did our document ids go? not
> sure); then we'd have to run svd or pca, both of which would accept
> RowMatrix (bummer! but we have collection of vectors); which would produce
> RowMatrix as well but kmeans training takes RDD of vectors (bummer again!).
>
> Not directly pluggable, although semi-trivially or trivially convertible.
> Plus strips off information that we potentially already have computed
> earlier in the pipeline, so we'd need to compute it again. I think problem
> is well demonstrated.
>
> Or, say, ALS stuff (implicit als in particular) is really an algebraic
> problem. Should be taking input in form of matrices (that my feature
> extraction algebraic pipeline perhaps has just prepared) but really takes
> POJOs. Bummer again.
>
> So what it is exactly we should be interoperable with in this picture if
> MLLib itself is not consistent?
>
> Let's look at the type system in flux there:
>
> we have
> (1) collection of vectors,
> (2) matrix of known dimensions for collection of vectors (row matrix),
> (3) indexedRowMatrix which is matrix of known dimension with keys that can
> be _only_ long; and
> (4) unknown but not infinitesimal amount of POJO-oriented approaches.
>
> But ok, let's constrain ourselves to matrix types only.
>
> Multitude of matrix types creates problems for tasks that require
> consistent key propagation (like  SVD or PCA or tf-idf, well demonstrated
> in the case of mllib). In the aforementioned case of dimensionality
> reduction over document collection, there's simply no way to propagate
> document ids to the rows of dimensionally-reduced data. As in none at all.
> as in hard no-work-around-exists stop.
>
> So. There's truly no need for multiple incompatible matrix types. There has
> to be just single matrix type. Just flexible one. And everything algebraic
> needs to use it.
>
> And if geometry is needed, then it could be either already known or lazily
> computed, but if it is not needed, nobody bothers to compute it. (i.e.
> truly no need And this knowledge should not be lost just because we have to
> convert between types.
>
> And if we want to express complex row keys such as for cluster assignments
> for example (my real case) then we could have a type with keys like
> Tuple2(rowKeyType, cluster-string).
>
> And that nobody really cares if intermediate results are really be row or
> column partitioned.
>
> All within single type of things.
>
> Bottom line, "interoperability" with mllib is both hard and trivial.
>
> Trivial is because whenever you need to convert, it is one line of code and
> also a trivial distributed map fusion element. (I do have pipelines
> streaming mllib methods within DRM-based pipelines, not just speculating).
>
> Hard is because there are so many types you may need/want to convert
> between, so there's not much point to even try to write converters for all
> possible cases but rather go on need-to-do basis.
>
> It is also hard because their type system obviously continues evolving as
> we speak. So no point chase the rabbit in the making.
>
> Epilogue
> =======
> There's no problem with the philosophy of the distributed and
> non-distributed algebra approach. It is incredibly useful in practice and I
> have proven it continuously (what is in public domain is just tip of the
> iceberg).
>
> Rather, there's organizational anemia in the project. Like corporate legal
> interests (that includes me not being able to do quick turnaround of
> fixes), and not having been able to tap into university resources. But i
> don't believe in any technical philosophy problem.
>
> So given that aforementioned resource/logistical anemia, it will likely
> take some when it would seem it  gets worse  before it gets better. But
> afaik there are multiple efforts going on behind the curtains to break red
> tapes. so i'd just wait a bit.
>


[1] 
https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/operators/AggregateOperator.java
[2] 
http://h2o-release.s3.amazonaws.com/h2o/rel-lambert/5/docs-website/developuser/java.html



Mime
View raw message