On 02/04/2015 03:37 PM, Dmitriy Lyubimov wrote:
> Re: Gokhan's PR post: here are my thoughts but i did not want to post it
> there since they are going beyond the scope of that PR's work to chase the
> root of the issue.
>
> on quasialgebraic methods
> ========================
>
> What is the dilemma here? don't see any.
>
> I already explained that no more than 25% of algorithms are truly 100%
> algebraic. But about 80% cannot avoid using some algebra and close to 95%
> could benefit from using algebra (even stochastic and monte carlo stuff).
>
> So we are building system that allows us to cut developer's work by at
> least 60% and make his work also more readable by 3000%. As far as I am
> concerned, that fulfills the goal. And I am perfectly happy writing a mix
> of enginespecific primitives and algebra.
>
> That's why i am a bit skeptical about attempts to abstract nonalgebraic
> primitives such as rowwise aggregators in one of the pull requests.
> Enginespecific primitives and algebra can perfectly coexist in the guts.
> And that's how i am doing my stuff in practice, except i now can skip 80%
> effort on algebra and bridging incompatible intputsoutputs.
I am **definitely** not advocating messing with the algebraic
optimizer. That was what I saw as the plus side to Gokhan's PR a
separate engine abstraction for qasi/nonalgebraic distributed methods.
I didn't comment on the PR either because admittedly I did not have a
chance to spend a lot of time on it. But my quick takeaway was that we
could take some very useful and hopefully (close to) ubiquitous
distributed operators and pass them through to the engine "guts".
I briefly looked through some of the flink and h2o code and noticed
Flink's aggregateOperator [1]
and h2o's MapReduce API and [2] my thought was that we could write pass
through operators for some of the more useful operations from mathscala
and then implement them fully in their respective packages. Though I am
not sure how this would work on either cases w.r.t. partitioning. e.g.
on h2o's distributed DataFrame. or flink for that matter. Again, I
havent had alot of time to look at these and see if this would work at all.
My thought was not to bring primitive engine specific aggregetors,
combiners, etc. into mathscala.
I had thought though that we were trying to develop a fully engine
agnostic algorithm library in on top of the RLike distributed BLAS.
So would the idea be to implement i.e. seq2sparse fully in the spark
module? It would seem to fracture the project a bit.
Or to implement algorithms sequentially if mapBlock() will not suffice
and then optimize them in their respective modules?
>
> None of that means that Rlike algebra cannot be engine agnostic. So people
> are unhappy about not being able to write the whole in totaly agnostic way?
> And so they (falsely) infer the pieces of their work cannot be helped by
> agnosticism individually, or the tools are not being as good as they might
> be without backend agnosticism? Sorry, but I fail to see the logic there.
>
> We proved algebra can be agnostic. I don't think this notion should be
> disputed.
>
> And even if there were a shred of real benefit by making algebra tools
> unagnostic, it would not ever outweigh tons of good we could get for the
> project by integrating with e.g. Flink folks. This one one the points MLLib
> will never be able to overcome  to be truly shared ML platform where
> people could create and share ML, but not just a bunch of adhoc spaghetty
> of distributed api calls and Sparknailed black boxes.
>
> Well yes methodology implementations will still have native distributed
> calls. Just not nearly as many as they otherwise would, and will be much
> more easier to support on another backend using Strategy patterns. E.g.
> implicit feedback problem that i originally wrote as quasimethod for Spark
> only, would've taken just an hour or so to add strategy for flink, since it
> retains all incore and distributed algebra work as is.
>
> Not to mention benefit of single type pipelining.
>
> And once we add hardwareaccelerated bindings for incore stuff, all these
> methods would immediately benefit from it.
>
> On MLLib interoperability issues,
> =========================
>
> well, let me ask you this: what it means to be MLLibinteroperable? is
> MLLib even interoperable within itself?
>
> E.g. i remember there was one most frequent request on the list here: how
> can we cluster dimensionallyreduced data?
>
> Let's look what it takes to do this in MLLib: First, we run tfidf, which
> produces collection of vectors (and where did our document ids go? not
> sure); then we'd have to run svd or pca, both of which would accept
> RowMatrix (bummer! but we have collection of vectors); which would produce
> RowMatrix as well but kmeans training takes RDD of vectors (bummer again!).
>
> Not directly pluggable, although semitrivially or trivially convertible.
> Plus strips off information that we potentially already have computed
> earlier in the pipeline, so we'd need to compute it again. I think problem
> is well demonstrated.
>
> Or, say, ALS stuff (implicit als in particular) is really an algebraic
> problem. Should be taking input in form of matrices (that my feature
> extraction algebraic pipeline perhaps has just prepared) but really takes
> POJOs. Bummer again.
>
> So what it is exactly we should be interoperable with in this picture if
> MLLib itself is not consistent?
>
> Let's look at the type system in flux there:
>
> we have
> (1) collection of vectors,
> (2) matrix of known dimensions for collection of vectors (row matrix),
> (3) indexedRowMatrix which is matrix of known dimension with keys that can
> be _only_ long; and
> (4) unknown but not infinitesimal amount of POJOoriented approaches.
>
> But ok, let's constrain ourselves to matrix types only.
>
> Multitude of matrix types creates problems for tasks that require
> consistent key propagation (like SVD or PCA or tfidf, well demonstrated
> in the case of mllib). In the aforementioned case of dimensionality
> reduction over document collection, there's simply no way to propagate
> document ids to the rows of dimensionallyreduced data. As in none at all.
> as in hard noworkaroundexists stop.
>
> So. There's truly no need for multiple incompatible matrix types. There has
> to be just single matrix type. Just flexible one. And everything algebraic
> needs to use it.
>
> And if geometry is needed, then it could be either already known or lazily
> computed, but if it is not needed, nobody bothers to compute it. (i.e.
> truly no need And this knowledge should not be lost just because we have to
> convert between types.
>
> And if we want to express complex row keys such as for cluster assignments
> for example (my real case) then we could have a type with keys like
> Tuple2(rowKeyType, clusterstring).
>
> And that nobody really cares if intermediate results are really be row or
> column partitioned.
>
> All within single type of things.
>
> Bottom line, "interoperability" with mllib is both hard and trivial.
>
> Trivial is because whenever you need to convert, it is one line of code and
> also a trivial distributed map fusion element. (I do have pipelines
> streaming mllib methods within DRMbased pipelines, not just speculating).
>
> Hard is because there are so many types you may need/want to convert
> between, so there's not much point to even try to write converters for all
> possible cases but rather go on needtodo basis.
>
> It is also hard because their type system obviously continues evolving as
> we speak. So no point chase the rabbit in the making.
>
> Epilogue
> =======
> There's no problem with the philosophy of the distributed and
> nondistributed algebra approach. It is incredibly useful in practice and I
> have proven it continuously (what is in public domain is just tip of the
> iceberg).
>
> Rather, there's organizational anemia in the project. Like corporate legal
> interests (that includes me not being able to do quick turnaround of
> fixes), and not having been able to tap into university resources. But i
> don't believe in any technical philosophy problem.
>
> So given that aforementioned resource/logistical anemia, it will likely
> take some when it would seem it gets worse before it gets better. But
> afaik there are multiple efforts going on behind the curtains to break red
> tapes. so i'd just wait a bit.
>
[1]
https://github.com/apache/flink/blob/master/flinkjava/src/main/java/org/apache/flink/api/java/operators/AggregateOperator.java
[2]
http://h2orelease.s3.amazonaws.com/h2o/rellambert/5/docswebsite/developuser/java.html
