mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitriy Lyubimov (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (MAHOUT-1500) H2O integration
Date Tue, 01 Apr 2014 18:01:23 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956824#comment-13956824
] 

Dmitriy Lyubimov edited comment on MAHOUT-1500 at 4/1/14 6:01 PM:
------------------------------------------------------------------

bq. Now it seems to me (with my limited exploring of Mahout) that it might actually be viable
to provide a "hadoop alternative" in the form of an alternate implementation of DistributedRowMatrix
(instead of AbstractMatrix) 

yes that's what i meant. On Scala side, this is done by introducing mix-ins DrmLike, RLikeOps,
RLikeDrmOps, RLikeVectorOps etc.etc. On java side, working with mix-ins (functionality-filled
traits) is of course not easy, but the important point is that it should be an alternative
hierarchy with an identical intersection of optimized linalg operators (operator-oriented
semantics in linear algebra). 

I. e. assumption is that to the end user (developer) it is more important that notation
{code}
a dot b
{code} 

means exactly the same regardless of whether a and b in-core or distributed; but it matters
significantly less whether a and b descend from different hierarchies (e.g. Matrix or DRM),
as long as operator dot(A,B) is defined for all possible type combinations (sparse, dense,
distributed).

bq. and AbstractJob (by internally using h2o's Frame/Vec and MRTask2 APIs), and thereby allow
for a runtime choice of Hadoop vs H2O. 

I care significantly less about Job api and Hadoop MR in particular. It is my belief they
are non-essential to the math user and therefore should be avoided altogether (and such notion
is eliminated in Spark Bindings)

bq. This seems like a reasonable first step?
Yes -- with caveat that logical mix-ins for distributed and in-core already exists in Scala
and Spark Bindings. Like i said, ideally mapping this logical layer into a particular physical
layer seems to be an indefinitely better architecture to me, than creating yet-another logical
layer specific to a particular back. However, i see that it would be hard to converge on that,
or at least i don't see how. I will extract an architecture slide from my talk and post a
link to illustrate the idea a bit later.


was (Author: dlyubimov):
bq. Now it seems to me (with my limited exploring of Mahout) that it might actually be viable
to provide a "hadoop alternative" in the form of an alternate implementation of DistributedRowMatrix
(instead of AbstractMatrix) 

yes that's what i meant. On Scala side, this is done by introducing mix-ins DrmLike, RLikeOps,
RLikeDrmOps, RLikeVectorOps etc.etc. On java side, working with mix-ins (functionality-filled
traits) is of course not easy, but the important point is that it should be an alternative
hierarchy with an identical intersection of optimized linalg operators (operator-oriented
semantics in linear algebra). 

I. e. assumption is that to the end user (developer) it is more important that notation
{code}
a dot b
{code} 

means exactly the same regardless of whether a and b in-core or distributed; but it matters
significantly less whether a and b descend from Matrix or DRM, as long as operator dot(A,B)
is defined for all possible type combinations (sparse, dense, distributed).

bq. and AbstractJob (by internally using h2o's Frame/Vec and MRTask2 APIs), and thereby allow
for a runtime choice of Hadoop vs H2O. 

I care significantly less about Job api and Hadoop MR in particular. It is my belief they
are non-essential to the math user and therefore should be avoided altogether (and such notion
is eliminated in Spark Bindings)

bq. This seems like a reasonable first step?
Yes -- with caveat that logical mix-ins for distributed and in-core already exists in Scala
and Spark Bindings. Like i said, ideally mapping this logical layer into a particular physical
layer seems to be an indefinitely better architecture to me, than creating yet-another logical
layer specific to a particular back. However, i see that it would be hard to converge on that,
or at least i don't see how. I will extract an architecture slide from my talk and post a
link to illustrate the idea a bit later.

> H2O integration
> ---------------
>
>                 Key: MAHOUT-1500
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1500
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Anand Avati
>             Fix For: 1.0
>
>
> Integration with h2o (github.com/0xdata/h2o) in order to exploit its high performance
computational abilities.
> Start with providing implementations of AbstractMatrix and AbstractVector, and more as
we make progress.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message