flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix Neutatz <neut...@googlemail.com>
Subject Re: New Flink team member - Kate Eri.
Date Mon, 13 Feb 2017 13:57:23 GMT
Hi Kate,

that's great news. This would help to boost ML on Flink a lot :)

Best regards,
Felix

2017-02-13 14:09 GMT+01:00 Katherin Eri <katherinmail@gmail.com>:

> Hello guys,
>
>
>
> It seems that issue FLINK-1730
> <https://issues.apache.org/jira/browse/FLINK-1730> significantly impacts
> integration of Flink with SystemML.
>
> They have checked several integrations, and Flink’s integration is slowest
> <https://github.com/apache/incubator-systemml/pull/119#
> issuecomment-222059794>
> :
>
>    - MR: LinregDS: 147s (2 jobs); LinregCG w/ 6 iterations: 361s (8 jobs)
>    w/ mmchain; 628s (14 jobs) w/o mmchain
>    - Spark: LinregDS: 71s (3 jobs); LinregCG w/ 6 iterations: 41s (8 jobs)
>    w/ mmchain; 48s (14 jobs) w/o mmchain
>    - Flink: LinregDS: 212s (3 jobs); LinregCG w/ 6 iterations: 1,047s (14
>    jobs) w/o mmchain
>
> This fact is caused, as already Felix said, by two reasons:
>
> 1)      FLINK-1730 <https://issues.apache.org/jira/browse/FLINK-1730>
>
> 2)      FLINK-4175 <https://issues.apache.org/jira/browse/FLINK-4175>
>
> As far as FLINK-1730 is not assigned to anyone we would like to take this
> ticket to work (my colleges could try to implement it).
>
> Further discussion of the topic related to FLINK-1730 I would like to
> handle in appropriate ticket.
>
>
> пт, 10 февр. 2017 г. в 19:57, Katherin Eri <katherinmail@gmail.com>:
>
> > I have created the ticket to discuss GPU related questions futher
> > https://issues.apache.org/jira/browse/FLINK-5782
> >
> > пт, 10 февр. 2017 г. в 18:16, Katherin Eri <katherinmail@gmail.com>:
> >
> > Thank you, Trevor!
> >
> > You have shared very valuable points; I will consider them.
> >
> > So I think, I should create finally ticket at Flink’s JIRA, at least for
> > Flink's GPU support and move the related discussion there?
> >
> > I will contact to Suneel regarding DL4J, thanks!
> >
> >
> > пт, 10 февр. 2017 г. в 17:44, Trevor Grant <trevor.d.grant@gmail.com>:
> >
> > Also RE: DL4J integration.
> >
> > Suneel had done some work on this a while back, and ran into issues.  You
> > might want to chat with him about the pitfalls and 'gotchyas' there.
> >
> >
> >
> > Trevor Grant
> > Data Scientist
> > https://github.com/rawkintrevo
> > http://stackexchange.com/users/3002022/rawkintrevo
> > http://trevorgrant.org
> >
> > *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> >
> >
> > On Fri, Feb 10, 2017 at 7:37 AM, Trevor Grant <trevor.d.grant@gmail.com>
> > wrote:
> >
> > > Sorry for chiming in late.
> > >
> > > GPUs on Flink.  Till raised a good point- you need to be able to fall
> > back
> > > to non-GPU resources if they aren't available.
> > >
> > > Fun fact: this has already been developed for Flink vis-a-vis the
> Apache
> > > Mahout project.
> > >
> > > In short- Mahout exposes a number of tensor functions (vector %*%
> matrix,
> > > matrix %*% matrix, etc).  If compiled for GPU support, those operations
> > are
> > > completed via GPU- and if no GPUs are in fact available, Mahout math
> > falls
> > > back to CPUs (and finally back to the JVM).
> > >
> > > How this should work is Flink takes care of shipping data around the
> > > cluster, and when data arrives at the local node- is dumped out to GPU
> > for
> > > calculation, loaded back up and shipped back around cluster.  In
> > practice,
> > > the lack of a persist method for intermediate results makes this
> > > troublesome (not because of GPUs but for calculating any sort of
> complex
> > > algorithm we expect to be able to cache intermediate results).
> > >
> > > +1 to FLINK-1730
> > >
> > > Everything in Mahout is modular- distributed engine
> > > (Flink/Spark/Write-your-own), Native Solvers (OpenMP / ViennaCL / CUDA
> /
> > > Write-your-own), algorithms, etc.
> > >
> > > So to sum up, you're noting the redundancy between ML packages in terms
> > of
> > > algorithms- I would recommend checking out Mahout before rolling your
> own
> > > GPU integration (else risk redundantly integrating GPUs). If nothing
> > else-
> > > it should give you some valuable insight regarding design
> considerations.
> > > Also FYI the goal of the Apache Mahout project is to address that
> problem
> > > precisely- implement an algorithm once in a mathematically expressive
> > DSL,
> > > which is abstracted above the engine so the same code easily ports
> > between
> > > engines / native solvers (i.e. CPU/GPU).
> > >
> > > https://github.com/apache/mahout/tree/master/viennacl-omp
> > > https://github.com/apache/mahout/tree/master/viennacl
> > >
> > > Best,
> > > tg
> > >
> > >
> > > Trevor Grant
> > > Data Scientist
> > > https://github.com/rawkintrevo
> > > http://stackexchange.com/users/3002022/rawkintrevo
> > > http://trevorgrant.org
> > >
> > > *"Fortunate is he, who is able to know the causes of things."  -Virgil*
> > >
> > >
> > > On Fri, Feb 10, 2017 at 7:01 AM, Katherin Eri <katherinmail@gmail.com>
> > > wrote:
> > >
> > >> Thank you Felix, for provided information.
> > >>
> > >> Currently I analyze the provided integration of Flink with SystemML.
> > >>
> > >> And also gather the information for the ticket  FLINK-1730
> > >> <https://issues.apache.org/jira/browse/FLINK-1730>, maybe we will
> take
> > it
> > >> to work, to unlock SystemML/Flink integration.
> > >>
> > >>
> > >>
> > >> чт, 9 февр. 2017 г. в 0:17, Felix Neutatz
> <neutatz@googlemail.com.invali
> > >> d>:
> > >>
> > >> > Hi Kate,
> > >> >
> > >> > 1) - Broadcast:
> > >> >
> > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-5%3A+
> > >> Only+send+data+to+each+taskmanager+once+for+broadcasts
> > >> >  - Caching: https://issues.apache.org/jira/browse/FLINK-1730
> > >> >
> > >> > 2) I have no idea about the GPU implementation. The SystemML mailing
> > >> list
> > >> > will probably help you out their.
> > >> >
> > >> > Best regards,
> > >> > Felix
> > >> >
> > >> > 2017-02-08 14:33 GMT+01:00 Katherin Eri <katherinmail@gmail.com>:
> > >> >
> > >> > > Thank you Felix, for your point, it is quite interesting.
> > >> > >
> > >> > > I will take a look at the code, of the provided Flink integration.
> > >> > >
> > >> > > 1)    You have these problems with Flink: >>we realized that the
> > lack
> > >> of
> > >> > a
> > >> > > caching operator and a broadcast issue highly effects the
> > performance,
> > >> > have
> > >> > > you already asked about this the community? In case yes: please
> > >> provide
> > >> > the
> > >> > > reference to the ticket or the topic of letter.
> > >> > >
> > >> > > 2)    You have said, that SystemML provides GPU support. I have
> seen
> > >> > > SystemML’s source code and would like to ask: why you have decided
> > to
> > >> > > implement your own integration with cuda? Did you try to consider
> > >> ND4J,
> > >> > or
> > >> > > because it is younger, you support your own implementation?
> > >> > >
> > >> > > вт, 7 февр. 2017 г. в 18:35, Felix Neutatz <
> neutatz@googlemail.com
> > >:
> > >> > >
> > >> > > > Hi Katherin,
> > >> > > >
> > >> > > > we are also working in a similar direction. We implemented a
> > >> prototype
> > >> > to
> > >> > > > integrate with SystemML:
> > >> > > > https://github.com/apache/incubator-systemml/pull/119
> > >> > > > SystemML provides many different matrix formats, operations, GPU
> > >> > support
> > >> > > > and a couple of DL algorithms. Unfortunately, we realized that
> the
> > >> lack
> > >> > > of
> > >> > > > a caching operator and a broadcast issue highly effects the
> > >> performance
> > >> > > > (e.g. compared to Spark). At the moment I am trying to tackle
> the
> > >> > > broadcast
> > >> > > > issue. But caching is still a problem for us.
> > >> > > >
> > >> > > > Best regards,
> > >> > > > Felix
> > >> > > >
> > >> > > > 2017-02-07 16:22 GMT+01:00 Katherin Eri <katherinmail@gmail.com
> >:
> > >> > > >
> > >> > > > > Thank you, Till.
> > >> > > > >
> > >> > > > > 1)      Regarding ND4J, I didn’t know about such a pity and
> > >> critical
> > >> > > > > restriction of it -> lack of sparsity optimizations, and you
> are
> > >> > right:
> > >> > > > > this issue is still actual for them. I saw that Flink uses
> > Breeze,
> > >> > but
> > >> > > I
> > >> > > > > thought its usage caused by some historical reasons.
> > >> > > > >
> > >> > > > > 2)      Regarding integration with DL4J, I have read the
> source
> > >> code
> > >> > of
> > >> > > > > DL4J/Spark integration, that’s why I have declined my idea of
> > >> reuse
> > >> > of
> > >> > > > > their word2vec implementation for now, for example. I can
> > perform
> > >> > > deeper
> > >> > > > > investigation of this topic, if it required.
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > So I feel that we have the following picture:
> > >> > > > >
> > >> > > > > 1)      DL integration investigation, could be part of Apache
> > >> Bahir.
> > >> > I
> > >> > > > can
> > >> > > > > perform futher investigation of this topic, but I thik we need
> > >> some
> > >> > > > > separated ticket for this to track this activity.
> > >> > > > >
> > >> > > > > 2)      GPU support, required for DL is interesting, but
> > requires
> > >> > ND4J
> > >> > > > for
> > >> > > > > example.
> > >> > > > >
> > >> > > > > 3)      ND4J couldn’t be incorporated because it doesn’t
> support
> > >> > > sparsity
> > >> > > > > <https://deeplearning4j.org/roadmap.html> [1].
> > >> > > > >
> > >> > > > > Regarding ND4J is this the single blocker for incorporation of
> > it
> > >> or
> > >> > > may
> > >> > > > be
> > >> > > > > some others known?
> > >> > > > >
> > >> > > > >
> > >> > > > > [1] https://deeplearning4j.org/roadmap.html
> > >> > > > >
> > >> > > > > вт, 7 февр. 2017 г. в 16:26, Till Rohrmann <
> > trohrmann@apache.org
> > >> >:
> > >> > > > >
> > >> > > > > Thanks for initiating this discussion Katherin. I think you're
> > >> right
> > >> > > that
> > >> > > > > in general it does not make sense to reinvent the wheel over
> and
> > >> over
> > >> > > > > again. Especially if you only have limited resources at hand.
> So
> > >> if
> > >> > we
> > >> > > > > could integrate Flink with some existing library that would be
> > >> great.
> > >> > > > >
> > >> > > > > In the past, however, we couldn't find a good library which
> > >> provided
> > >> > > > enough
> > >> > > > > freedom to integrate it with Flink. Especially if you want to
> > have
> > >> > > > > distributed and somewhat high-performance implementations of
> ML
> > >> > > > algorithms
> > >> > > > > you would have to take Flink's execution model (capabilities
> as
> > >> well
> > >> > as
> > >> > > > > limitations) into account. That is mainly the reason why we
> > >> started
> > >> > > > > implementing some of the algorithms "natively" on Flink.
> > >> > > > >
> > >> > > > > If I remember correctly, then the problem with ND4J was and
> > still
> > >> is
> > >> > > that
> > >> > > > > it does not support sparse matrices which was a requirement
> from
> > >> our
> > >> > > > side.
> > >> > > > > As far as I know, it is quite common that you have sparse data
> > >> > > structures
> > >> > > > > when dealing with large scale problems. That's why we built
> our
> > >> own
> > >> > > > > abstraction which can have different implementations.
> Currently,
> > >> the
> > >> > > > > default implementation uses Breeze.
> > >> > > > >
> > >> > > > > I think the support for GPU based operations and the actual
> > >> resource
> > >> > > > > management are two orthogonal things. The implementation would
> > >> have
> > >> > to
> > >> > > > work
> > >> > > > > with no GPUs available anyway. If the system detects that GPUs
> > are
> > >> > > > > available, then ideally it would exploit them. Thus, we could
> > add
> > >> > this
> > >> > > > > feature later and maybe integrate it with FLINK-5131 [1].
> > >> > > > >
> > >> > > > > Concerning the integration with DL4J I think that Theo's
> > proposal
> > >> to
> > >> > do
> > >> > > > it
> > >> > > > > in a separate repository (maybe as part of Apache Bahir) is a
> > good
> > >> > > idea.
> > >> > > > > We're currently thinking about outsourcing some of Flink's
> > >> libraries
> > >> > > into
> > >> > > > > sub projects. This could also be an option for the DL4J
> > >> integration
> > >> > > then.
> > >> > > > > In general I think it should be feasible to run DL4J on Flink
> > >> given
> > >> > > that
> > >> > > > it
> > >> > > > > also runs on Spark. Have you already looked at it closer?
> > >> > > > >
> > >> > > > > [1] https://issues.apache.org/jira/browse/FLINK-5131
> > >> > > > >
> > >> > > > > Cheers,
> > >> > > > > Till
> > >> > > > >
> > >> > > > > On Tue, Feb 7, 2017 at 11:47 AM, Katherin Eri <
> > >> > katherinmail@gmail.com>
> > >> > > > > wrote:
> > >> > > > >
> > >> > > > > > Thank you Theodore, for your reply.
> > >> > > > > >
> > >> > > > > > 1)    Regarding GPU, your point is clear and I agree with
> it,
> > >> ND4J
> > >> > > > looks
> > >> > > > > > appropriate. But, my current understanding is that, we also
> > >> need to
> > >> > > > cover
> > >> > > > > > some resource management questions -> when we need to
> provide
> > >> GPU
> > >> > > > support
> > >> > > > > > we also need to manage it like resource. For example, Mesos
> > has
> > >> > > already
> > >> > > > > > supported GPU like resource item: Initial support for GPU
> > >> > resources.
> > >> > > > > > <
> > >> > https://issues.apache.org/jira/browse/MESOS-4424?jql=text%20~%20GPU
> > >> > > >
> > >> > > > > > Flink
> > >> > > > > > uses Mesos as cluster manager, and this means that this
> > feature
> > >> of
> > >> > > > Mesos
> > >> > > > > > could be reused. Also memory managing questions in Flink
> > >> regarding
> > >> > > GPU
> > >> > > > > > should be clarified.
> > >> > > > > >
> > >> > > > > > 2)    Regarding integration with DL4J: what stops us to
> > >> initialize
> > >> > > > ticket
> > >> > > > > > and start the discussion around this topic? We need some
> user
> > >> story
> > >> > > or
> > >> > > > > the
> > >> > > > > > community is not sure that DL is really helpful? Why the
> > >> discussion
> > >> > > > with
> > >> > > > > > Adam
> > >> > > > > > Gibson just finished with no implementation of any idea?
> What
> > >> > > concerns
> > >> > > > do
> > >> > > > > > we have?
> > >> > > > > >
> > >> > > > > > пн, 6 февр. 2017 г. в 15:01, Theodore Vasiloudis <
> > >> > > > > > theodoros.vasiloudis@gmail.com>:
> > >> > > > > >
> > >> > > > > > > Hell all,
> > >> > > > > > >
> > >> > > > > > > This is point that has come up in the past: Given the
> > >> multitude
> > >> > of
> > >> > > ML
> > >> > > > > > > libraries out there, should we have native implementations
> > in
> > >> > > FlinkML
> > >> > > > > or
> > >> > > > > > > try to integrate other libraries instead?
> > >> > > > > > >
> > >> > > > > > > We haven't managed to reach a consensus on this before. My
> > >> > opinion
> > >> > > is
> > >> > > > > > that
> > >> > > > > > > there is definitely value in having ML algorithms written
> > >> > natively
> > >> > > in
> > >> > > > > > > Flink, both for performance optimization,
> > >> > > > > > > but more importantly for engineering simplicity, we don't
> > >> want to
> > >> > > > force
> > >> > > > > > > users to use yet another piece of software to run their ML
> > >> algos
> > >> > > (at
> > >> > > > > > least
> > >> > > > > > > for a basic set of algorithms).
> > >> > > > > > >
> > >> > > > > > > We have in the past  discussed integrations with DL4J
> > >> > (particularly
> > >> > > > > ND4J)
> > >> > > > > > > with Adam Gibson, the core developer of the library, but
> we
> > >> never
> > >> > > got
> > >> > > > > > > around to implementing anything.
> > >> > > > > > >
> > >> > > > > > > Whether it makes sense to have an integration with DL4J as
> > >> part
> > >> > of
> > >> > > > the
> > >> > > > > > > Flink distribution would be up for discussion. I would
> > >> suggest to
> > >> > > > make
> > >> > > > > it
> > >> > > > > > > an independent repo to allow for
> > >> > > > > > > faster dev/release cycles, and because it wouldn't be
> > directly
> > >> > > > related
> > >> > > > > to
> > >> > > > > > > the core of Flink so it would add extra reviewing burden
> to
> > an
> > >> > > > already
> > >> > > > > > > overloaded group of committers.
> > >> > > > > > >
> > >> > > > > > > Natively supporting GPU calculations in Flink would be
> much
> > >> > better
> > >> > > > > > achieved
> > >> > > > > > > through a library like ND4J, the engineering burden would
> be
> > >> too
> > >> > > much
> > >> > > > > > > otherwise.
> > >> > > > > > >
> > >> > > > > > > Regards,
> > >> > > > > > > Theodore
> > >> > > > > > >
> > >> > > > > > > On Mon, Feb 6, 2017 at 11:26 AM, Katherin Eri <
> > >> > > > katherinmail@gmail.com>
> > >> > > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Hello, guys.
> > >> > > > > > > >
> > >> > > > > > > > Theodore, last week I started the review of the PR:
> > >> > > > > > > > https://github.com/apache/flink/pull/2735 related to
> > >> *word2Vec
> > >> > > for
> > >> > > > > > > Flink*.
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > During this review I have asked myself: why do we need
> to
> > >> > > implement
> > >> > > > > > such
> > >> > > > > > > a
> > >> > > > > > > > very popular algorithm like *word2vec one more time*,
> when
> > >> > there
> > >> > > is
> > >> > > > > > > already
> > >> > > > > > > > available implementation in java provided by
> > >> > deeplearning4j.org
> > >> > > > > > > > <https://deeplearning4j.org/word2vec> library (DL4J ->
> > >> Apache
> > >> > 2
> > >> > > > > > > licence).
> > >> > > > > > > > This library tries to promote itself, there is a hype
> > >> around it
> > >> > > in
> > >> > > > ML
> > >> > > > > > > > sphere, and it was integrated with Apache Spark, to
> > provide
> > >> > > > scalable
> > >> > > > > > > > deeplearning calculations.
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > *That's why I thought: could we integrate with this
> > library
> > >> or
> > >> > > not
> > >> > > > > also
> > >> > > > > > > and
> > >> > > > > > > > Flink? *
> > >> > > > > > > >
> > >> > > > > > > > 1) Personally I think, providing support and deployment
> of
> > >> > > > > > > > *Deeplearning(DL)
> > >> > > > > > > > algorithms/models in Flink* is promising and attractive
> > >> > feature,
> > >> > > > > > because:
> > >> > > > > > > >
> > >> > > > > > > >     a) during last two years DL proved its efficiency
> and
> > >> these
> > >> > > > > > > algorithms
> > >> > > > > > > > used in many applications. For example *Spotify *uses DL
> > >> based
> > >> > > > > > algorithms
> > >> > > > > > > > for music content extraction: Recommending music on
> > Spotify
> > >> > with
> > >> > > > deep
> > >> > > > > > > > learning AUGUST 05, 2014
> > >> > > > > > > > <http://benanne.github.io/2014/08/05/spotify-cnns.html>
> > for
> > >> > > their
> > >> > > > > > music
> > >> > > > > > > > recommendations. Developers need to scale up DL
> manually,
> > >> that
> > >> > > > causes
> > >> > > > > a
> > >> > > > > > > lot
> > >> > > > > > > > of work, so that’s why such platforms like Flink should
> > >> support
> > >> > > > these
> > >> > > > > > > > models deployment.
> > >> > > > > > > >
> > >> > > > > > > >     b) Here is presented the scope of Deeplearning usage
> > >> cases
> > >> > > > > > > > <https://deeplearning4j.org/use_cases>, so many of this
> > >> > > scenarios
> > >> > > > > > > related
> > >> > > > > > > > to scenarios, that could be supported on Flink.
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > 2) But DL uncover such questions like:
> > >> > > > > > > >
> > >> > > > > > > >     a) scale up calculations over machines
> > >> > > > > > > >
> > >> > > > > > > >     b) perform these calculations both over CPU and GPU.
> > >> GPU is
> > >> > > > > > required
> > >> > > > > > > to
> > >> > > > > > > > train big DL models, otherwise learning process could
> have
> > >> very
> > >> > > > slow
> > >> > > > > > > > convergence.
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > 3) I have checked this DL4J library, which already have
> > >> reach
> > >> > > > support
> > >> > > > > > of
> > >> > > > > > > > many attractive DL models like: Recurrent Networks and
> > >> LSTMs,
> > >> > > > > > > Convolutional
> > >> > > > > > > > Networks (CNN), Restricted Boltzmann Machines (RBM) and
> > >> others.
> > >> > > So
> > >> > > > we
> > >> > > > > > > won’t
> > >> > > > > > > > need to implement them independently, but only provide
> the
> > >> > > ability
> > >> > > > of
> > >> > > > > > > > execution of this models over Flink cluster, the quite
> > >> similar
> > >> > > way
> > >> > > > > like
> > >> > > > > > > it
> > >> > > > > > > > was integrated with Apache Spark.
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > Because of all of this I propose:
> > >> > > > > > > >
> > >> > > > > > > > 1)    To create new ticket in Flink’s JIRA for
> integration
> > >> of
> > >> > > Flink
> > >> > > > > > with
> > >> > > > > > > > DL4J and decide on which side this integration should be
> > >> > > > implemented.
> > >> > > > > > > >
> > >> > > > > > > > 2)    Support natively GPU resources in Flink and allow
> > >> > > > calculations
> > >> > > > > > over
> > >> > > > > > > > them, like that is described in this publication
> > >> > > > > > > > https://www.oreilly.com/learning/accelerating-spark-
> > >> > > > > > workloads-using-gpus
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > *Regarding original issue Implement Word2Vec
> > >> > > > > > > > <https://issues.apache.org/jira/browse/FLINK-2094>in
> > Flink,
> > >> > *I
> > >> > > > have
> > >> > > > > > > > investigated its implementation in DL4J and  that
> > >> > implementation
> > >> > > of
> > >> > > > > > > > integration DL4J with Apache Spark, and got several
> > points:
> > >> > > > > > > >
> > >> > > > > > > > It seems that idea of building of our own implementation
> > of
> > >> > > > word2vec
> > >> > > > > in
> > >> > > > > > > > Flink not such a bad solution, because: This DL4J was
> > >> forced to
> > >> > > > > > > reimplement
> > >> > > > > > > > its original word2Vec over Spark. I have checked the
> > >> > integration
> > >> > > of
> > >> > > > > > DL4J
> > >> > > > > > > > with Spark, and found that it is too strongly coupled
> with
> > >> > Spark
> > >> > > > API,
> > >> > > > > > so
> > >> > > > > > > > that it is impossible just to take some DL4J API and
> reuse
> > >> it,
> > >> > > > > instead
> > >> > > > > > we
> > >> > > > > > > > need to implement independent integration for Flink.
> > >> > > > > > > >
> > >> > > > > > > > *That’s why we simply finish implementation of current
> PR
> > >> > > > > > > > **independently **from
> > >> > > > > > > > integration to DL4J.*
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > Could you please provide your opinion regarding my
> > questions
> > >> > and
> > >> > > > > > points,
> > >> > > > > > > > what do you think about them?
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > пн, 6 февр. 2017 г. в 12:51, Katherin Eri <
> > >> > > katherinmail@gmail.com
> > >> > > > >:
> > >> > > > > > > >
> > >> > > > > > > > > Sorry, guys I need to finish this letter first.
> > >> > > > > > > > >   Full version of it will come shortly.
> > >> > > > > > > > >
> > >> > > > > > > > > пн, 6 февр. 2017 г. в 12:49, Katherin Eri <
> > >> > > > katherinmail@gmail.com
> > >> > > > > >:
> > >> > > > > > > > >
> > >> > > > > > > > > Hello, guys.
> > >> > > > > > > > > Theodore, last week I started the review of the PR:
> > >> > > > > > > > > https://github.com/apache/flink/pull/2735 related to
> > >> > *word2Vec
> > >> > > > for
> > >> > > > > > > > Flink*.
> > >> > > > > > > > >
> > >> > > > > > > > > During this review I have asked myself: why do we need
> > to
> > >> > > > implement
> > >> > > > > > > such
> > >> > > > > > > > a
> > >> > > > > > > > > very popular algorithm like *word2vec one more time*,
> > when
> > >> > > there
> > >> > > > is
> > >> > > > > > > > > already availabe implementation in java provided by
> > >> > > > > > deeplearning4j.org
> > >> > > > > > > > > <https://deeplearning4j.org/word2vec> library (DL4J
> ->
> > >> > Apache
> > >> > > 2
> > >> > > > > > > > licence).
> > >> > > > > > > > > This library tries to promote it self, there is a hype
> > >> around
> > >> > > it
> > >> > > > in
> > >> > > > > > ML
> > >> > > > > > > > > sphere, and  it was integrated with Apache Spark, to
> > >> provide
> > >> > > > > scalable
> > >> > > > > > > > > deeplearning calculations.
> > >> > > > > > > > > That's why I thought: could we integrate with this
> > >> library or
> > >> > > not
> > >> > > > > > also
> > >> > > > > > > > and
> > >> > > > > > > > > Flink?
> > >> > > > > > > > > 1) Personally I think, providing support and
> deployment
> > of
> > >> > > > > > Deeplearning
> > >> > > > > > > > > algorithms/models in Flink is promising and attractive
> > >> > feature,
> > >> > > > > > > because:
> > >> > > > > > > > >     a) during last two years deeplearning proved its
> > >> > efficiency
> > >> > > > and
> > >> > > > > > > this
> > >> > > > > > > > > algorithms used in many applications. For example
> > *Spotify
> > >> > > *uses
> > >> > > > DL
> > >> > > > > > > based
> > >> > > > > > > > > algorithms for music content extraction: Recommending
> > >> music
> > >> > on
> > >> > > > > > Spotify
> > >> > > > > > > > > with deep learning AUGUST 05, 2014
> > >> > > > > > > > > <http://benanne.github.io/
> 2014/08/05/spotify-cnns.html>
> > >> for
> > >> > > > their
> > >> > > > > > > music
> > >> > > > > > > > > recommendations. Doing this natively scalable is very
> > >> > > attractive.
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > I have investigated that implementation of integration
> > >> DL4J
> > >> > > with
> > >> > > > > > Apache
> > >> > > > > > > > > Spark, and got several points:
> > >> > > > > > > > >
> > >> > > > > > > > > 1) It seems that idea of building of our own
> > >> implementation
> > >> > of
> > >> > > > > > word2vec
> > >> > > > > > > > > not such a bad solution, because the integration of
> DL4J
> > >> with
> > >> > > > Spark
> > >> > > > > > is
> > >> > > > > > > > too
> > >> > > > > > > > > strongly coupled with Saprk API and it will take time
> > from
> > >> > the
> > >> > > > side
> > >> > > > > > of
> > >> > > > > > > > DL4J
> > >> > > > > > > > > to adopt this integration to Flink. Also I have
> expected
> > >> that
> > >> > > we
> > >> > > > > will
> > >> > > > > > > be
> > >> > > > > > > > > able to call just some API, it is not such thing.
> > >> > > > > > > > > 2)
> > >> > > > > > > > >
> > >> > > > > > > > > https://deeplearning4j.org/use_cases
> > >> > > > > > > > > https://www.analyticsvidhya.com/blog/2017/01/t-sne-
> > >> > > > > > > > implementation-r-python/
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > чт, 19 янв. 2017 г. в 13:29, Till Rohrmann <
> > >> > > trohrmann@apache.org
> > >> > > > >:
> > >> > > > > > > > >
> > >> > > > > > > > > Hi Katherin,
> > >> > > > > > > > >
> > >> > > > > > > > > welcome to the Flink community. Always great to see
> new
> > >> > people
> > >> > > > > > joining
> > >> > > > > > > > the
> > >> > > > > > > > > community :-)
> > >> > > > > > > > >
> > >> > > > > > > > > Cheers,
> > >> > > > > > > > > Till
> > >> > > > > > > > >
> > >> > > > > > > > > On Tue, Jan 17, 2017 at 1:02 PM, Katherin Sotenko <
> > >> > > > > > > > katherinmail@gmail.com>
> > >> > > > > > > > > wrote:
> > >> > > > > > > > >
> > >> > > > > > > > > > ok, I've got it.
> > >> > > > > > > > > > I will take a look at
> > >> > > > https://github.com/apache/flink/pull/2735
> > >> > > > > .
> > >> > > > > > > > > >
> > >> > > > > > > > > > вт, 17 янв. 2017 г. в 14:36, Theodore Vasiloudis <
> > >> > > > > > > > > > theodoros.vasiloudis@gmail.com>:
> > >> > > > > > > > > >
> > >> > > > > > > > > > > Hello Katherin,
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > Welcome to the Flink community!
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > The ML component definitely needs a lot of work
> you
> > >> are
> > >> > > > > correct,
> > >> > > > > > we
> > >> > > > > > > > are
> > >> > > > > > > > > > > facing similar problems to CEP, which we'll
> > hopefully
> > >> > > resolve
> > >> > > > > > with
> > >> > > > > > > > the
> > >> > > > > > > > > > > restructuring Stephan has mentioned in that
> thread.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > If you'd like to help out with PRs we have many
> > open,
> > >> > one I
> > >> > > > > have
> > >> > > > > > > > > started
> > >> > > > > > > > > > > reviewing but got side-tracked is the Word2Vec one
> > >> [1].
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > Best,
> > >> > > > > > > > > > > Theodore
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > [1] https://github.com/apache/flink/pull/2735
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > On Tue, Jan 17, 2017 at 12:17 PM, Fabian Hueske <
> > >> > > > > > fhueske@gmail.com
> > >> > > > > > > >
> > >> > > > > > > > > > wrote:
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > > Hi Katherin,
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > welcome to the Flink community!
> > >> > > > > > > > > > > > Help with reviewing PRs is always very welcome
> > and a
> > >> > > great
> > >> > > > > way
> > >> > > > > > to
> > >> > > > > > > > > > > > contribute.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > Best, Fabian
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > 2017-01-17 11:17 GMT+01:00 Katherin Sotenko <
> > >> > > > > > > > katherinmail@gmail.com
> > >> > > > > > > > > >:
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > > Thank you, Timo.
> > >> > > > > > > > > > > > > I have started the analysis of the topic.
> > >> > > > > > > > > > > > > And if it necessary, I will try to perform the
> > >> review
> > >> > > of
> > >> > > > > > other
> > >> > > > > > > > > pulls)
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > вт, 17 янв. 2017 г. в 13:09, Timo Walther <
> > >> > > > > > twalthr@apache.org
> > >> > > > > > > >:
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > Hi Katherin,
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > great to hear that you would like to
> > contribute!
> > >> > > > Welcome!
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > I gave you contributor permissions. You can
> > now
> > >> > > assign
> > >> > > > > > issues
> > >> > > > > > > > to
> > >> > > > > > > > > > > > > > yourself. I assigned FLINK-1750 to you.
> > >> > > > > > > > > > > > > > Right now there are many open ML pull
> > requests,
> > >> you
> > >> > > are
> > >> > > > > > very
> > >> > > > > > > > > > welcome
> > >> > > > > > > > > > > to
> > >> > > > > > > > > > > > > > review the code of others, too.
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > Timo
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > Am 17/01/17 um 10:39 schrieb Katherin
> Sotenko:
> > >> > > > > > > > > > > > > > > Hello, All!
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > I'm Kate Eri, I'm java developer with
> 6-year
> > >> > > > enterprise
> > >> > > > > > > > > > experience,
> > >> > > > > > > > > > > > > also
> > >> > > > > > > > > > > > > > I
> > >> > > > > > > > > > > > > > > have some expertise with scala (half of
> the
> > >> > year).
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > Last 2 years I have participated in
> several
> > >> > BigData
> > >> > > > > > > projects
> > >> > > > > > > > > that
> > >> > > > > > > > > > > > were
> > >> > > > > > > > > > > > > > > related to Machine Learning (Time series
> > >> > analysis,
> > >> > > > > > > > Recommender
> > >> > > > > > > > > > > > systems,
> > >> > > > > > > > > > > > > > > Social networking) and ETL. I have
> > experience
> > >> > with
> > >> > > > > > Hadoop,
> > >> > > > > > > > > Apache
> > >> > > > > > > > > > > > Spark
> > >> > > > > > > > > > > > > > and
> > >> > > > > > > > > > > > > > > Hive.
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > I’m fond of ML topic, and I see that Flink
> > >> > project
> > >> > > > > > requires
> > >> > > > > > > > > some
> > >> > > > > > > > > > > work
> > >> > > > > > > > > > > > > in
> > >> > > > > > > > > > > > > > > this area, so that’s why I would like to
> > join
> > >> > Flink
> > >> > > > and
> > >> > > > > > ask
> > >> > > > > > > > me
> > >> > > > > > > > > to
> > >> > > > > > > > > > > > grant
> > >> > > > > > > > > > > > > > the
> > >> > > > > > > > > > > > > > > assignment of the ticket
> > >> > > > > > > > > > > > > > https://issues.apache.org/jira
> > >> /browse/FLINK-1750
> > >> > > > > > > > > > > > > > > to me.
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message