flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Till Rohrmann <trohrm...@apache.org>
Subject Re: Kicking off the Machine Learning Library
Date Sat, 03 Jan 2015 19:18:06 GMT
+1 for the initial steps, which I can implement.

On Sat, Jan 3, 2015 at 8:15 PM, Till Rohrmann <trohrmann@apache.org> wrote:

> Hi,
> happy new year everyone. I hope you all had some relaxing holidays.
> I really like the idea of having a machine learning library because this
> allows users to quickly solve problems without having to dive too deep into
> the system. Moreover, it is a good way to show what the system is capable
> of in terms of expressibility and programming paradigms.
> Currently, we already have more or less optimised versions of several ML
> algorithms implemented with Flink. I'm aware of the following
> implementations: PageRank, ALS, KMeans, ConnectedComponents. I think that
> these algorithms constitute a good foundation for the ML library.
> I like the idea to have optimised algorithms which can be mixed with
> Mahout DSL code. As far as I can tell, the interoperation should not be too
> difficult if the "future" Flink backend is used to execute the Mahout DSL
> program. Internally, the Mahout DSL performs its operations on a row-wise
> partitioned matrix which is represented as a DataSet[(Key, Vector)].
> Providing some wrapper functions to transform different matrix
> representations into the row-wise representation should be the first step.
> Another idea could be to investigate to what extent Flink can interact
> with the Parameter Server and which algorithms could be adapted to benefit
> from these systems.
> Greetings,
> Till
> On Fri, Jan 2, 2015 at 3:46 PM, Stephan Ewen <sewen@apache.org> wrote:
>> Hi everyone!
>> Happy new year, first of all and I hope you had a nice end-of-the-year
>> season.
>> I thought that it is a good time now to officially kick off the creation
>> of
>> a library of machine learning algorithms. There are a lot of individual
>> artifacts and algorithms floating around which we should consolidate.
>> The machine-learning library in Flink would stand on two legs:
>>  - A collection of efficient implementations for common problems and
>> algorithms, e.g., Regression (logistic), clustering (k-Means, Canopy),
>> Matrix Factorization (ALS), ...
>>  - An adapter to the linear algebra DSL in Apache Mahout.
>> In the long run, it would be the goal to be able to mix and match code
>> from
>> both parts.
>> The linear algebra DSL is very convenient when it comes to quickly
>> composing an algorithm, or some custom pre- and post-processing steps.
>> For some complex algorithms, however, a low level system specific
>> implementation is necessary to make the algorithm efficient.
>> Being able to call the tailored algorithms from the DSL would combine the
>> benefits.
>> As a concrete initial step, I suggest to do the following:
>> 1) We create a dedicated maven sub-project for that ML library
>> (flink-lib-ml). The project gets two sub-projects, one for the collection
>> of specialized algorithms, one for the Mahout DSL
>> 2) We add the code for the existing specialized algorithms. As followup
>> work, we need to consolidate data types between those algorithms, to
>> ensure
>> that they can easily be combined/chained.
>> 3) The code for the Flink bindings to the Mahout DSL will actually reside
>> in the Mahout project, which we need to add as a dependency to
>> flink-lib-ml.
>> 4) We add some examples of Mahout DSL algorithms, and a template how to
>> use
>> them within Flink programs.
>> 5) Create a good introductory readme.md, outlining this structure. The
>> readme can also track the implemented algorithms and the ones we put on
>> the
>> roadmap.
>> Comments welcome :-)
>> Greetings,
>> Stephan

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message