mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: [jira] [Commented] (MAHOUT-1500) H2O integration
Date Tue, 01 Apr 2014 18:32:26 GMT
On Tue, Apr 1, 2014 at 3:09 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

> I would rather see a matrix that looks local but acts global so that
> coders can produce very simple code that is still parallelized.
>

And that's exactly how it is done in Bindings.

This discussion is not about that though. this discussion is about why
doing that on Matrix and Vector hierarchy is a bad idea.

Trying to explain why.

Matrix and Vector api, historically, mix in a lot of concerns (not just
linalg operators). E.g. they also include things like element data access
views and patterns (getQuick, getRow, iterateNonZero); in-core specific
optimizer things like  */

  double getLookupCost();

  double getIteratorAdvanceCost();

etc. Normally that is addressed via Mix-ins but it wasn't (and it is
hard in Java in general).

Corrollary to that is simple fact that 95% of mahout (and, more
importantly, outside code) is something like

for (el:v.iterateNonZero()) {

   ... do something with element
}

*which is not parallelizable at all and would require major
refactoring of apis and all user code to make it so. *

*Corollary to that are 2 arguments :*
*(1) doing what you say on AbstractMatrix or AbstractVector hierarchy
is not possible without a "nuclear option" on the api, which will send
a ripple effect inside and outside Mahout (my outside code in
particular too);*

(2) and even if we invoked "nuclear option", doing so does not have
benefit compared to introducing a parallel type hierarchy for
distributed matrices since write-once-run-everywhere works there too.

The idea of write-once-run either in-core or out-of-core is very
noble, but in practice is neither quite feasible (mostly because of
component lifecycle and optimization checkpointing concerns), nor it
has a significant value. (i.e. if one can have ssvd and dssvd in 29
lines, assuming same algorithm even has a parallelization strategy),
then there's no harm in having two separate things for in-core and
out-of-core -- dssvd()  and ssvd().

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message