mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank Scholten <scholten....@gmail.com>
Subject Re: [jira] [Commented] (MAHOUT-1500) H2O integration
Date Tue, 01 Apr 2014 11:16:54 GMT
I suggest to create a separate mahout-h2o module that shows a simple end-to-end example, including
vectorizing. I had a look at the h2o API, it looks interesting, and I am curious to see how
to vectorize data from different sources. We could start by taking an existing example like
clustering Reuters for instance.

I would not suggest to immediately try to extend from existing Mahout APIs. I agree with Dmitriy
that we shouldn't mix  distributed and local code. After creating a few examples can we see
where code can be reused and where the boundaries are. It also gives everyone a feel for the
h2o API. Then we can extract common code.

I also like Anand's idea of creating an h2o alternative of a Hadoop job. I do like to see
this being implemented as a Java bean with a separate CLI driver so class it is easy to use
in Java. Current Mahout jobs have to called via main methods with String arrays. See the lucene2seq
as an example of the bean config idea.

Frank

On Apr 1, 2014, at 12:09, Ted Dunning <ted.dunning@gmail.com> wrote:

> I would rather see a matrix that looks local but acts global so that coders can produce
very simple code that is still parallelized.  
> 
> Sent from my iPhone
> 
>> On Apr 1, 2014, at 11:09, "Anand Avati (JIRA)" <jira@apache.org> wrote:
>> 
>> 
>>   [ https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956283#comment-13956283
] 
>> 
>> Anand Avati commented on MAHOUT-1500:
>> -------------------------------------
>> 
>> Thanks for your feedback, Dmitry.
>> 
>> Now it seems to me (with my limited exploring of Mahout) that it might actually be
viable to provide a "hadoop alternative" in the form of an alternate implementation of DistributedRowMatrix
(instead of AbstractMatrix) and AbstractJob (by internally using h2o's Frame/Vec and MRTask2
APIs), and thereby allow for a runtime choice of Hadoop vs H2O. This seems like a reasonable
first step?
>> 
>>> H2O integration
>>> ---------------
>>> 
>>>               Key: MAHOUT-1500
>>>               URL: https://issues.apache.org/jira/browse/MAHOUT-1500
>>>           Project: Mahout
>>>        Issue Type: Improvement
>>>          Reporter: Anand Avati
>>>           Fix For: 1.0
>>> 
>>> 
>>> Integration with h2o (github.com/0xdata/h2o) in order to exploit its high performance
computational abilities.
>>> Start with providing implementations of AbstractMatrix and AbstractVector, and
more as we make progress.
>> 
>> 
>> 
>> --
>> This message was sent by Atlassian JIRA
>> (v6.2#6252)

Mime
View raw message