flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henry Saputra <henry.sapu...@gmail.com>
Subject Re: Kicking off the Machine Learning Library
Date Fri, 09 Jan 2015 01:09:37 GMT
Hi Till,

Yes, that is pretty much how they do it.
The trick is to get access to shared data between Spark realm and H2O realm.
Prev gen they use Tachyon but in the latest stint they use run H2O in
the Spark executor to use the heap shared memory to do the trick.

I am trying to hook us up with H2O guys, lets hope it pays off =)

- Henry

On Thu, Jan 8, 2015 at 3:48 AM, Till Rohrmann <trohrmann@apache.org> wrote:
> Great Henry,
>
> Sparkling Water looks really interesting. We would probably have to take a
> similar approach to make Flink interact with H2O.
>
> I briefly looked into the code and that's how they do it (or at least how I
> understood it ;-). Correct me if I got it wrong:
>
> They first start up a Spark cluster and start from within a RDD an H2O
> worker. Afterwards, they have on each node where a Spark executor runs also
> an H2O worker running. Consequently, they have on each node access to the
> distributed key value storage of H2O. I really like how elegantly they set
> up H2O from within Spark :-) Flink should be capable of doing the same. We
> only have to leave some memory for H2O.
>
> Once this is done, they only have to translate a RDD into a DataFrame and
> vice versa:
>
> RDD => DataFrame:
> This operation is realized as a mapPartition operation of Spark which is
> executed as an action. Currently, tuple elements and a SchemaRDD are
> supported. The mapPartition operation takes the tuples and groups fields
> with the same positional index together to build H2O's vector chunks which
> are then stored in the distributed key value store (DKV). We can do the
> same.
>
> DataFrame => RDD:
> Here they implemented a H2ORDD which reads from the DKV the corresponding
> vector chunks of the partition and constructs for each row an instance. We
> should be able to convert a DataFrame to a DataSet by implementing a
> H2ODKVInputFormat which basically does the same.
>
> I think that the implementation details might not be trivial but judging
> from the size of sparkling-water (1300 lines of Scala code) it should
> definitely be feasible. And as Henry already mentioned by having the H2O
> integration, we get a lot of ML algorithms for free. We could also talk to
> 0xdata to see if they are interested to help us with the effort.
>
> Greets,
>
> Till
>
>
> On Wed, Jan 7, 2015 at 7:08 PM, Henry Saputra <henry.saputra@gmail.com>
> wrote:
>
>> 0xdata (now is called H2O) is developing integration with Spark with
>> the project called Sparkling Water [1].
>> It creates new RDD that could connect to H2O cluster to pass the
>> higher order function to execute in the ML flow.
>>
>> The easiest way to use H2O is with R binding [2][3] but I think we
>> would want to interact with H2O as via the REST APIs [4].
>>
>> - Henry
>>
>> [1] https://github.com/h2oai/sparkling-water
>> [2] http://www.slideshare.net/anqifu1/big-data-science-with-h2o-in-r
>> [3] http://docs.h2o.ai/Ruser/rtutorial.html
>> [4] http://docs.h2o.ai/developuser/rest.html
>>
>> On Wed, Jan 7, 2015 at 3:10 AM, Stephan Ewen <sewen@apache.org> wrote:
>> > Thanks Henry!
>> >
>> > Do you know of a good source that gives pointers or examples how to
>> > interact with H2O ?
>> >
>> > Stephan
>> >
>> >
>> > On Sun, Jan 4, 2015 at 7:14 PM, Till Rohrmann <trohrmann@apache.org>
>> wrote:
>> >
>> >> The idea to work with H2O sounds really interesting.
>> >>
>> >> In terms of the Mahout DSL this would mean that we have to translate a
>> >> Flink dataset into H2O's basic abstraction of distributed data and vice
>> >> versa. Everything other than writing to disk with one system and reading
>> >> from there with the other is probably non-trivial and hard to realize.
>> >> On Jan 4, 2015 9:18 AM, "Henry Saputra" <henry.saputra@gmail.com>
>> wrote:
>> >>
>> >> > Happy new year all!
>> >> >
>> >> > Like the idea to add ML module with Flink.
>> >> >
>> >> > As I have mentioned to Kostas, Stephan, and Robert before, I would
>> >> > love to see if we could work with H20 project [1], and it seemed like
>> >> > the community has added support for it for Apache Mahout backend
>> >> > binding [2].
>> >> >
>> >> > So we might get some additional scale ML algos like deep learning.
>> >> >
>> >> > Definitely would love to help with this initiative =)
>> >> >
>> >> > - Henry
>> >> >
>> >> > [1] https://github.com/h2oai/h2o-dev
>> >> > [2] https://issues.apache.org/jira/browse/MAHOUT-1500
>> >> >
>> >> > On Fri, Jan 2, 2015 at 6:46 AM, Stephan Ewen <sewen@apache.org>
>> wrote:
>> >> > > Hi everyone!
>> >> > >
>> >> > > Happy new year, first of all and I hope you had a nice
>> end-of-the-year
>> >> > > season.
>> >> > >
>> >> > > I thought that it is a good time now to officially kick off the
>> >> creation
>> >> > of
>> >> > > a library of machine learning algorithms. There are a lot of
>> individual
>> >> > > artifacts and algorithms floating around which we should
>> consolidate.
>> >> > >
>> >> > > The machine-learning library in Flink would stand on two legs:
>> >> > >
>> >> > >  - A collection of efficient implementations for common problems
and
>> >> > > algorithms, e.g., Regression (logistic), clustering (k-Means,
>> Canopy),
>> >> > > Matrix Factorization (ALS), ...
>> >> > >
>> >> > >  - An adapter to the linear algebra DSL in Apache Mahout.
>> >> > >
>> >> > > In the long run, it would be the goal to be able to mix and match
>> code
>> >> > from
>> >> > > both parts.
>> >> > > The linear algebra DSL is very convenient when it comes to quickly
>> >> > > composing an algorithm, or some custom pre- and post-processing
>> steps.
>> >> > > For some complex algorithms, however, a low level system specific
>> >> > > implementation is necessary to make the algorithm efficient.
>> >> > > Being able to call the tailored algorithms from the DSL would
>> combine
>> >> the
>> >> > > benefits.
>> >> > >
>> >> > >
>> >> > > As a concrete initial step, I suggest to do the following:
>> >> > >
>> >> > > 1) We create a dedicated maven sub-project for that ML library
>> >> > > (flink-lib-ml). The project gets two sub-projects, one for the
>> >> collection
>> >> > > of specialized algorithms, one for the Mahout DSL
>> >> > >
>> >> > > 2) We add the code for the existing specialized algorithms. As
>> followup
>> >> > > work, we need to consolidate data types between those algorithms,
to
>> >> > ensure
>> >> > > that they can easily be combined/chained.
>> >> > >
>> >> > > 3) The code for the Flink bindings to the Mahout DSL will actually
>> >> reside
>> >> > > in the Mahout project, which we need to add as a dependency to
>> >> > flink-lib-ml.
>> >> > >
>> >> > > 4) We add some examples of Mahout DSL algorithms, and a template
>> how to
>> >> > use
>> >> > > them within Flink programs.
>> >> > >
>> >> > > 5) Create a good introductory readme.md, outlining this structure.
>> The
>> >> > > readme can also track the implemented algorithms and the ones
we
>> put on
>> >> > the
>> >> > > roadmap.
>> >> > >
>> >> > >
>> >> > > Comments welcome :-)
>> >> > >
>> >> > >
>> >> > > Greetings,
>> >> > > Stephan
>> >> >
>> >>
>>

Mime
View raw message