flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Katherin Eri <katherinm...@gmail.com>
Subject Re: New Flink team member - Kate Eri.
Date Mon, 06 Feb 2017 10:26:24 GMT
Hello, guys.

Theodore, last week I started the review of the PR:
https://github.com/apache/flink/pull/2735 related to *word2Vec for Flink*.



During this review I have asked myself: why do we need to implement such a
very popular algorithm like *word2vec one more time*, when there is already
available implementation in java provided by deeplearning4j.org
<https://deeplearning4j.org/word2vec> library (DL4J -> Apache 2 licence).
This library tries to promote itself, there is a hype around it in ML
sphere, and it was integrated with Apache Spark, to provide scalable
deeplearning calculations.


*That's why I thought: could we integrate with this library or not also and
Flink? *

1) Personally I think, providing support and deployment of *Deeplearning(DL)
algorithms/models in Flink* is promising and attractive feature, because:

    a) during last two years DL proved its efficiency and these algorithms
used in many applications. For example *Spotify *uses DL based algorithms
for music content extraction: Recommending music on Spotify with deep
learning AUGUST 05, 2014
<http://benanne.github.io/2014/08/05/spotify-cnns.html> for their music
recommendations. Developers need to scale up DL manually, that causes a lot
of work, so that’s why such platforms like Flink should support these
models deployment.

    b) Here is presented the scope of Deeplearning usage cases
<https://deeplearning4j.org/use_cases>, so many of this scenarios related
to scenarios, that could be supported on Flink.


2) But DL uncover such questions like:

    a) scale up calculations over machines

    b) perform these calculations both over CPU and GPU. GPU is required to
train big DL models, otherwise learning process could have very slow
convergence.


3) I have checked this DL4J library, which already have reach support of
many attractive DL models like: Recurrent Networks and LSTMs, Convolutional
Networks (CNN), Restricted Boltzmann Machines (RBM) and others. So we won’t
need to implement them independently, but only provide the ability of
execution of this models over Flink cluster, the quite similar way like it
was integrated with Apache Spark.


Because of all of this I propose:

1)    To create new ticket in Flink’s JIRA for integration of Flink with
DL4J and decide on which side this integration should be implemented.

2)    Support natively GPU resources in Flink and allow calculations over
them, like that is described in this publication
https://www.oreilly.com/learning/accelerating-spark-workloads-using-gpus



*Regarding original issue Implement Word2Vec
<https://issues.apache.org/jira/browse/FLINK-2094>in Flink,  *I have
investigated its implementation in DL4J and  that implementation of
integration DL4J with Apache Spark, and got several points:

It seems that idea of building of our own implementation of word2vec in
Flink not such a bad solution, because: This DL4J was forced to reimplement
its original word2Vec over Spark. I have checked the integration of DL4J
with Spark, and found that it is too strongly coupled with Spark API, so
that it is impossible just to take some DL4J API and reuse it, instead we
need to implement independent integration for Flink.

*That’s why we simply finish implementation of current PR
**independently **from
integration to DL4J.*



Could you please provide your opinion regarding my questions and points,
what do you think about them?



пн, 6 февр. 2017 г. в 12:51, Katherin Eri <katherinmail@gmail.com>:

> Sorry, guys I need to finish this letter first.
>   Full version of it will come shortly.
>
> пн, 6 февр. 2017 г. в 12:49, Katherin Eri <katherinmail@gmail.com>:
>
> Hello, guys.
> Theodore, last week I started the review of the PR:
> https://github.com/apache/flink/pull/2735 related to *word2Vec for Flink*.
>
> During this review I have asked myself: why do we need to implement such a
> very popular algorithm like *word2vec one more time*, when there is
> already availabe implementation in java provided by deeplearning4j.org
> <https://deeplearning4j.org/word2vec> library (DL4J -> Apache 2 licence).
> This library tries to promote it self, there is a hype around it in ML
> sphere, and  it was integrated with Apache Spark, to provide scalable
> deeplearning calculations.
> That's why I thought: could we integrate with this library or not also and
> Flink?
> 1) Personally I think, providing support and deployment of Deeplearning
> algorithms/models in Flink is promising and attractive feature, because:
>     a) during last two years deeplearning proved its efficiency and this
> algorithms used in many applications. For example *Spotify *uses DL based
> algorithms for music content extraction: Recommending music on Spotify
> with deep learning AUGUST 05, 2014
> <http://benanne.github.io/2014/08/05/spotify-cnns.html> for their music
> recommendations. Doing this natively scalable is very attractive.
>
>
> I have investigated that implementation of integration DL4J with Apache
> Spark, and got several points:
>
> 1) It seems that idea of building of our own implementation of word2vec
> not such a bad solution, because the integration of DL4J with Spark is too
> strongly coupled with Saprk API and it will take time from the side of DL4J
> to adopt this integration to Flink. Also I have expected that we will be
> able to call just some API, it is not such thing.
> 2)
>
> https://deeplearning4j.org/use_cases
> https://www.analyticsvidhya.com/blog/2017/01/t-sne-implementation-r-python/
>
>
> чт, 19 янв. 2017 г. в 13:29, Till Rohrmann <trohrmann@apache.org>:
>
> Hi Katherin,
>
> welcome to the Flink community. Always great to see new people joining the
> community :-)
>
> Cheers,
> Till
>
> On Tue, Jan 17, 2017 at 1:02 PM, Katherin Sotenko <katherinmail@gmail.com>
> wrote:
>
> > ok, I've got it.
> > I will take a look at  https://github.com/apache/flink/pull/2735.
> >
> > вт, 17 янв. 2017 г. в 14:36, Theodore Vasiloudis <
> > theodoros.vasiloudis@gmail.com>:
> >
> > > Hello Katherin,
> > >
> > > Welcome to the Flink community!
> > >
> > > The ML component definitely needs a lot of work you are correct, we are
> > > facing similar problems to CEP, which we'll hopefully resolve with the
> > > restructuring Stephan has mentioned in that thread.
> > >
> > > If you'd like to help out with PRs we have many open, one I have
> started
> > > reviewing but got side-tracked is the Word2Vec one [1].
> > >
> > > Best,
> > > Theodore
> > >
> > > [1] https://github.com/apache/flink/pull/2735
> > >
> > > On Tue, Jan 17, 2017 at 12:17 PM, Fabian Hueske <fhueske@gmail.com>
> > wrote:
> > >
> > > > Hi Katherin,
> > > >
> > > > welcome to the Flink community!
> > > > Help with reviewing PRs is always very welcome and a great way to
> > > > contribute.
> > > >
> > > > Best, Fabian
> > > >
> > > >
> > > >
> > > > 2017-01-17 11:17 GMT+01:00 Katherin Sotenko <katherinmail@gmail.com
> >:
> > > >
> > > > > Thank you, Timo.
> > > > > I have started the analysis of the topic.
> > > > > And if it necessary, I will try to perform the review of other
> pulls)
> > > > >
> > > > >
> > > > > вт, 17 янв. 2017 г. в 13:09, Timo Walther <twalthr@apache.org>:
> > > > >
> > > > > > Hi Katherin,
> > > > > >
> > > > > > great to hear that you would like to contribute! Welcome!
> > > > > >
> > > > > > I gave you contributor permissions. You can now assign issues
to
> > > > > > yourself. I assigned FLINK-1750 to you.
> > > > > > Right now there are many open ML pull requests, you are very
> > welcome
> > > to
> > > > > > review the code of others, too.
> > > > > >
> > > > > > Timo
> > > > > >
> > > > > >
> > > > > > Am 17/01/17 um 10:39 schrieb Katherin Sotenko:
> > > > > > > Hello, All!
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > I'm Kate Eri, I'm java developer with 6-year enterprise
> > experience,
> > > > > also
> > > > > > I
> > > > > > > have some expertise with scala (half of the year).
> > > > > > >
> > > > > > > Last 2 years I have participated in several BigData projects
> that
> > > > were
> > > > > > > related to Machine Learning (Time series analysis, Recommender
> > > > systems,
> > > > > > > Social networking) and ETL. I have experience with Hadoop,
> Apache
> > > > Spark
> > > > > > and
> > > > > > > Hive.
> > > > > > >
> > > > > > >
> > > > > > > I’m fond of ML topic, and I see that Flink project requires
> some
> > > work
> > > > > in
> > > > > > > this area, so that’s why I would like to join Flink and
ask me
> to
> > > > grant
> > > > > > the
> > > > > > > assignment of the ticket
> > > > > > https://issues.apache.org/jira/browse/FLINK-1750
> > > > > > > to me.
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message