Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@mahout.apache.org
Received-SPF: pass (nike.apache.org: domain of dlieu.7@gmail.com designates
 209.85.219.41 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <7F3F9B54-6FCC-4A86-AD0E-038E5229CE33@gmail.com>
References: <JIRA.12705861.1396335771131@arcas>
	<JIRA.12705861.1396335771131.42158.1396343358872@arcas>
	<7F3F9B54-6FCC-4A86-AD0E-038E5229CE33@gmail.com>
Date: Tue, 1 Apr 2014 11:32:26 -0700
Message-ID: 
 <CAPud8Tp8vqBMa3zGvcebBEzZWLmP1T-qXbYJtf+NZaJw9_qGUw@mail.gmail.com>
Subject: Re: [jira] [Commented] (MAHOUT-1500) H2O integration
From: Dmitriy Lyubimov <dlieu.7@gmail.com>
To: "dev@mahout.apache.org" <dev@mahout.apache.org>
Content-Type: multipart/alternative; boundary=001a11c2e458ce6cfb04f5ff6340

--001a11c2e458ce6cfb04f5ff6340
Content-Type: text/plain; charset=ISO-8859-1

On Tue, Apr 1, 2014 at 3:09 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

> I would rather see a matrix that looks local but acts global so that
> coders can produce very simple code that is still parallelized.
>

And that's exactly how it is done in Bindings.

This discussion is not about that though. this discussion is about why
doing that on Matrix and Vector hierarchy is a bad idea.

Trying to explain why.

Matrix and Vector api, historically, mix in a lot of concerns (not just
linalg operators). E.g. they also include things like element data access
views and patterns (getQuick, getRow, iterateNonZero); in-core specific
optimizer things like  */

  double getLookupCost();

  double getIteratorAdvanceCost();

etc. Normally that is addressed via Mix-ins but it wasn't (and it is
hard in Java in general).

Corrollary to that is simple fact that 95% of mahout (and, more
importantly, outside code) is something like

for (el:v.iterateNonZero()) {

   ... do something with element
}

*which is not parallelizable at all and would require major
refactoring of apis and all user code to make it so. *

*Corollary to that are 2 arguments :*
*(1) doing what you say on AbstractMatrix or AbstractVector hierarchy
is not possible without a "nuclear option" on the api, which will send
a ripple effect inside and outside Mahout (my outside code in
particular too);*

(2) and even if we invoked "nuclear option", doing so does not have
benefit compared to introducing a parallel type hierarchy for
distributed matrices since write-once-run-everywhere works there too.

The idea of write-once-run either in-core or out-of-core is very
noble, but in practice is neither quite feasible (mostly because of
component lifecycle and optimization checkpointing concerns), nor it
has a significant value. (i.e. if one can have ssvd and dssvd in 29
lines, assuming same algorithm even has a parallelization strategy),
then there's no harm in having two separate things for in-core and
out-of-core -- dssvd()  and ssvd().

--001a11c2e458ce6cfb04f5ff6340--