flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Till Rohrmann <trohrm...@apache.org>
Subject Re: Flink and Spark
Date Sat, 27 Dec 2014 12:18:21 GMT
Hi Samarth,

you can also find different implementations of ALS on Flink here:
https://github.com/project-flink/flink-perf/tree/master/flink-jobs/src/main/scala/com/github/projectflink/als
.

On Fri, Dec 26, 2014 at 4:12 PM, Robert Metzger <rmetzger@apache.org> wrote:

> For the Python API, there is a pending pull request:
> https://github.com/apache/incubator-flink/pull/202 It is still work in
> progress, but feedback is, as always, appreciated.
>
>
>
> On Fri, Dec 26, 2014 at 3:41 PM, Samarth Mailinglist <
> mailinglistsamarth@gmail.com> wrote:
>
>> Thanks a lot Márton and Gyula!
>>
>> On Fri, Dec 26, 2014 at 2:42 PM, Márton Balassi <balassi.marton@gmail.com
>> > wrote:
>>
>>> Hey,
>>>
>>> You can find some ml examples like LinerRegression [1, 2] or KMeans [3,
>>> 4] in the examples package in both java and scala as a quickstart.
>>>
>>> [1]
>>> https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java/ml/LinearRegression.java
>>> [2]
>>> https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-scala-examples/src/main/scala/org/apache/flink/examples/scala/ml/LinearRegression.scala
>>> [3]
>>> https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java/clustering/KMeans.java
>>> [4]
>>> https://github.com/apache/incubator-flink/blob/release-0.8/flink-examples/flink-scala-examples/src/main/scala/org/apache/flink/examples/scala/clustering/KMeans.scala
>>>
>>> On Fri, Dec 26, 2014 at 7:31 AM, Samarth Mailinglist <
>>> mailinglistsamarth@gmail.com> wrote:
>>>
>>>> Thank you the answers, folks.
>>>> Can anyone provide me a link for any implementation of an ML algorithm
>>>> on Flink?
>>>>
>>>> On Thu, Dec 25, 2014 at 8:07 PM, Gyula Fóra <gyfora@apache.org> wrote:
>>>>
>>>>> Hey,
>>>>>
>>>>> 1-2. As for failure recovery, there is a difference how the Flink
>>>>> batch and streaming programs handle failures. The failed parts of the
batch
>>>>> jobs currently restart upon failures but there is an ongoing effort on
fine
>>>>> grained fault tolerance which is somewhat similar to sparks lineage
>>>>> tracking. (so technically this is exactly once semantic but that is
>>>>> somewhat meaningless for batch jobs)
>>>>>
>>>>> For streaming programs we are currently working on fault tolerance, we
>>>>> are hoping to support at least once processing guarantees in the 0.9
>>>>> release. After that we will focus our research efforts on an high
>>>>> performance implementation of exactly once processing semantics, which
is
>>>>> still a hard topic in streaming systems. Storm's trident's exaclty once
>>>>> semantics can only provide very low throughput while we are trying hard
to
>>>>> avoid this issue, as our streaming system is capable of much higher
>>>>> throughput than storm in general as you can see on some perf measurements.
>>>>>
>>>>> 3. There are already many ml algorithms implemented for Flink but they
>>>>> are scattered all around. We are planning to collect them in a machine
>>>>> learning library soon. We are also implementing an adapter for Samoa
which
>>>>> will provide some streaming machine learning algorithms as well. Samoa
>>>>> integration should be ready in January.
>>>>>
>>>>> 4. Flink carefully manages its memory use to avoid heap errors, and
>>>>> utilizing memory space as effectively as it can. The optimizer for batch
>>>>> programs also takes care of a lot of optimization steps that the user
would
>>>>> manually have to do in other system, like optimizing the order of
>>>>> transformations etc. There are of course parts of the program that still
>>>>> needs to modified for maximal performance, for example parallelism settings
>>>>> for some operators in some cases.
>>>>>
>>>>> 5. As for the status of the Python API I personally cannot say very
>>>>> much, maybe someone can jump in and help me with that question :)
>>>>>
>>>>> Regards,
>>>>> Gyula
>>>>>
>>>>> On Thu, Dec 25, 2014 at 11:58 AM, Samarth Mailinglist <
>>>>> mailinglistsamarth@gmail.com> wrote:
>>>>>
>>>>>> Thank you for your answer. I have a couple of follow up questions:
>>>>>> 1. Does it support 'exactly once semantics' that Spark and Storm
>>>>>> support?
>>>>>> 2. (Related to 1) What happens when an error occurs during
>>>>>> processing?
>>>>>> 3. Is there a plan for adding Machine Learning support on top of
>>>>>> Flink? Say Alternative Least Squares, Basic Naive Bayes?
>>>>>> 4. When you say Flink manages itself, does it mean I don't have to
>>>>>> fiddle with number of partitions (Spark), number of reduces / happers
>>>>>> (Hadoop?) to optimize performance? (In some cases this might be needed)
>>>>>> 5. How far along is the Python API? I don't see the specs in the
>>>>>> Website.
>>>>>>
>>>>>> On Thu, Dec 25, 2014 at 4:31 AM, Márton Balassi <mbalassi@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Dear Samarth,
>>>>>>>
>>>>>>> Besides the discussions you have mentioned [1] I can recommend
one
>>>>>>> of our recent presentations [2], especially the distinguishing
Flink
>>>>>>> section (from slide 16).
>>>>>>>
>>>>>>> It is generally a difficult question as both the systems are
rapidly
>>>>>>> evolving, so the answer can become outdated quite fast. However
there are
>>>>>>> fundamental design features that are highly unlikely to change,
for example
>>>>>>> Spark uses "true" batch processing, meaning that intermediate
results are
>>>>>>> materialized (mostly in memory) as RDDs. Flink's engine is internally
more
>>>>>>> like streaming, forwarding the results to the next operator asap.
The
>>>>>>> latter can yield performance benefits for more complex jobs.
Flink also
>>>>>>> gives you a query optimizer, spills gracefully to disk when the
system runs
>>>>>>> out of memory and has some cool features around serialization.
For
>>>>>>> performance numbers and some more insight please check out the
presentation
>>>>>>> [2] and do not hesitate to post a follow-up mail here if you
come across
>>>>>>> something unclear or extraordinary.
>>>>>>>
>>>>>>> [1]
>>>>>>> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark
>>>>>>> [2] http://www.slideshare.net/GyulaFra/flink-apachecon
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Marton
>>>>>>>
>>>>>>> On Tue, Dec 23, 2014 at 6:19 PM, Samarth Mailinglist <
>>>>>>> mailinglistsamarth@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hey folks, I have a noob question.
>>>>>>>>
>>>>>>>> I already looked up the archives and saw a couple of discussions
>>>>>>>> <http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark>
>>>>>>>> about Spark and Flink.
>>>>>>>>
>>>>>>>> I am familiar with spark (the python API, esp MLLib), and
I see
>>>>>>>> many similarities between Flink and Spark.
>>>>>>>>
>>>>>>>> How does Flink distinguish itself from Spark?
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message