mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitriy Lyubimov (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (MAHOUT-1500) H2O integration
Date Wed, 02 Apr 2014 01:14:15 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957216#comment-13957216
] 

Dmitriy Lyubimov edited comment on MAHOUT-1500 at 4/2/14 1:12 AM:
------------------------------------------------------------------

@Anand, Bottom line, the core of AbstractMatrix and Vector is elementwise iterators and direct
element accessors. Lacking closure(functional) programming, they don't work for the distributed
stuff. 

There are two ways with such approach: either declare core abstractions unsupported in distributed
implementation, which just proves AbstractMatrix and Vector are not good abstractions for
that work. (why would one need an abstraction, if its major and core contracts are all of
a sudden declared optional or deprecated). 

Truth to be told, there is some Matrix api that uses FP -- two major things are aggregate()
and assign(). However, this still doesn't get us anywhere in a sense that we should support
_all_ core contracts, not just assign() and aggregate().

Another way of going about it is to heavily refactor core abstraction in favor of functional
support, while deprecating or eliminating direct access. I call this "nuclear option". Because
it sends ripple effects not only thru Mahout, but thru any 3rd party code that uses mahout-math.
(in my case specifically). It will force people reconsider using mahout because of stability
issues in the areas where it was promised to be stable.

Extending DistributedRowMatrix api.. I kind of dubious about it as well, since it is also
unusable without major FP infusion, and frankly kind of ancient.

More likely, a completely new FP-laced distributed Matrix representation is desired. SparkBindings
went that path and created FP-laced DRM api. But this is entirely Scala side abstraction,
with Scala function literals etc. So if you are looking to create a java distributed matrix
abstraction, this is not going to be useful at all either.

So more likely, you need a completely new FP-oriented java API interface. Something like X2OMatrix.java.
This will fragment project even further, but all marketing fluff excluding, that's the only
realistic option i see that might work. 

I would also question (kinda) the wisdom of a standalone distributed vector abstraction. On
Hadoop side and spark side this abstraction is completely bypassed (it is assumed that real
vector will always fit into single machine memory). In situations where vector might be formed
as a result of distributed operation (e.g. A %*% x) the result is simply a distributed single-column
matrix, from which the column can be always collected in front end via collection/slicing
api. 

 


was (Author: dlyubimov):
@Anand, Bottom line, the core of AbstractMatrix and Vector is elementwise iterators and direct
element accessors. Lacking distributed programming, they don't work for the distributed stuff.


There are two ways with such approach: either declare core abstractions unsupported in distributed
implementation, which just proves AbstractMatrix and Vector are not good abstractions for
that work. (why would one need an abstraction, if its major and core contracts are all of
a sudden declared optional or deprecated). 

Truth to be told, there is some Matrix api that uses FP -- two major things are aggregate()
and assign(). However, this still doesn't get us anywhere in a sense that we should support
_all_ core contracts, not just assign() and aggregate().

Another way of going about it is to heavily refactor core abstraction in favor of functional
support, while deprecating or eliminating direct access. I call this "nuclear option". Because
it sends ripple effects not only thru Mahout, but thru any 3rd party code that uses mahout-math.
(in my case specifically). It will force people reconsider using mahout because of stability
issues in the areas where it was promised to be stable.

Extending DistributedRowMatrix api.. I kind of dubious about it as well, since it is also
unusable without major FP infusion, and frankly kind of ancient.

More likely, a completely new FP-laced distributed Matrix representation is desired. SparkBindings
went that path and created FP-laced DRM api. But this is entirely Scala side abstraction,
with Scala function literals etc. So if you are looking to create a java distributed matrix
abstraction, this is not going to be useful at all either.

So more likely, you need a completely new FP-oriented java API interface. Something like X2OMatrix.java.
This will fragment project even further, but all marketing fluff excluding, that's the only
realistic option i see that might work. 

I would also question (kinda) the wisdom of a standalone distributed vector abstraction. On
Hadoop side and spark side this abstraction is completely bypassed (it is assumed that real
vector will always fit into single machine memory). In situations where vector might be formed
as a result of distributed operation (e.g. A %*% x) the result is simply a distributed single-column
matrix, from which the column can be always collected in front end via collection/slicing
api. 

 

> H2O integration
> ---------------
>
>                 Key: MAHOUT-1500
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1500
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Anand Avati
>             Fix For: 1.0
>
>
> Integration with h2o (github.com/0xdata/h2o) in order to exploit its high performance
computational abilities.
> Start with providing implementations of AbstractMatrix and AbstractVector, and more as
we make progress.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message