mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <>
Subject Re: [jira] [Commented] (MAHOUT-1500) H2O integration
Date Tue, 01 Apr 2014 18:32:26 GMT
On Tue, Apr 1, 2014 at 3:09 AM, Ted Dunning <> wrote:

> I would rather see a matrix that looks local but acts global so that
> coders can produce very simple code that is still parallelized.

And that's exactly how it is done in Bindings.

This discussion is not about that though. this discussion is about why
doing that on Matrix and Vector hierarchy is a bad idea.

Trying to explain why.

Matrix and Vector api, historically, mix in a lot of concerns (not just
linalg operators). E.g. they also include things like element data access
views and patterns (getQuick, getRow, iterateNonZero); in-core specific
optimizer things like  */

  double getLookupCost();

  double getIteratorAdvanceCost();

etc. Normally that is addressed via Mix-ins but it wasn't (and it is
hard in Java in general).

Corrollary to that is simple fact that 95% of mahout (and, more
importantly, outside code) is something like

for (el:v.iterateNonZero()) {

   ... do something with element

*which is not parallelizable at all and would require major
refactoring of apis and all user code to make it so. *

*Corollary to that are 2 arguments :*
*(1) doing what you say on AbstractMatrix or AbstractVector hierarchy
is not possible without a "nuclear option" on the api, which will send
a ripple effect inside and outside Mahout (my outside code in
particular too);*

(2) and even if we invoked "nuclear option", doing so does not have
benefit compared to introducing a parallel type hierarchy for
distributed matrices since write-once-run-everywhere works there too.

The idea of write-once-run either in-core or out-of-core is very
noble, but in practice is neither quite feasible (mostly because of
component lifecycle and optimization checkpointing concerns), nor it
has a significant value. (i.e. if one can have ssvd and dssvd in 29
lines, assuming same algorithm even has a parallelization strategy),
then there's no harm in having two separate things for in-core and
out-of-core -- dssvd()  and ssvd().

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message