flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gyula Fóra <gyf...@apache.org>
Subject Re: Flink and Spark
Date Thu, 25 Dec 2014 14:37:21 GMT

1-2. As for failure recovery, there is a difference how the Flink batch and
streaming programs handle failures. The failed parts of the batch jobs
currently restart upon failures but there is an ongoing effort on fine
grained fault tolerance which is somewhat similar to sparks lineage
tracking. (so technically this is exactly once semantic but that is
somewhat meaningless for batch jobs)

For streaming programs we are currently working on fault tolerance, we are
hoping to support at least once processing guarantees in the 0.9 release.
After that we will focus our research efforts on an high performance
implementation of exactly once processing semantics, which is still a hard
topic in streaming systems. Storm's trident's exaclty once semantics can
only provide very low throughput while we are trying hard to avoid this
issue, as our streaming system is capable of much higher throughput than
storm in general as you can see on some perf measurements.

3. There are already many ml algorithms implemented for Flink but they are
scattered all around. We are planning to collect them in a machine learning
library soon. We are also implementing an adapter for Samoa which will
provide some streaming machine learning algorithms as well. Samoa
integration should be ready in January.

4. Flink carefully manages its memory use to avoid heap errors, and
utilizing memory space as effectively as it can. The optimizer for batch
programs also takes care of a lot of optimization steps that the user would
manually have to do in other system, like optimizing the order of
transformations etc. There are of course parts of the program that still
needs to modified for maximal performance, for example parallelism settings
for some operators in some cases.

5. As for the status of the Python API I personally cannot say very much,
maybe someone can jump in and help me with that question :)


On Thu, Dec 25, 2014 at 11:58 AM, Samarth Mailinglist <
mailinglistsamarth@gmail.com> wrote:

> Thank you for your answer. I have a couple of follow up questions:
> 1. Does it support 'exactly once semantics' that Spark and Storm support?
> 2. (Related to 1) What happens when an error occurs during processing?
> 3. Is there a plan for adding Machine Learning support on top of Flink?
> Say Alternative Least Squares, Basic Naive Bayes?
> 4. When you say Flink manages itself, does it mean I don't have to fiddle
> with number of partitions (Spark), number of reduces / happers (Hadoop?) to
> optimize performance? (In some cases this might be needed)
> 5. How far along is the Python API? I don't see the specs in the Website.
> On Thu, Dec 25, 2014 at 4:31 AM, Márton Balassi <mbalassi@apache.org>
> wrote:
>> Dear Samarth,
>> Besides the discussions you have mentioned [1] I can recommend one of our
>> recent presentations [2], especially the distinguishing Flink section (from
>> slide 16).
>> It is generally a difficult question as both the systems are rapidly
>> evolving, so the answer can become outdated quite fast. However there are
>> fundamental design features that are highly unlikely to change, for example
>> Spark uses "true" batch processing, meaning that intermediate results are
>> materialized (mostly in memory) as RDDs. Flink's engine is internally more
>> like streaming, forwarding the results to the next operator asap. The
>> latter can yield performance benefits for more complex jobs. Flink also
>> gives you a query optimizer, spills gracefully to disk when the system runs
>> out of memory and has some cool features around serialization. For
>> performance numbers and some more insight please check out the presentation
>> [2] and do not hesitate to post a follow-up mail here if you come across
>> something unclear or extraordinary.
>> [1]
>> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark
>> [2] http://www.slideshare.net/GyulaFra/flink-apachecon
>> Best,
>> Marton
>> On Tue, Dec 23, 2014 at 6:19 PM, Samarth Mailinglist <
>> mailinglistsamarth@gmail.com> wrote:
>>> Hey folks, I have a noob question.
>>> I already looked up the archives and saw a couple of discussions
>>> <http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark>
>>> about Spark and Flink.
>>> I am familiar with spark (the python API, esp MLLib), and I see many
>>> similarities between Flink and Spark.
>>> How does Flink distinguish itself from Spark?

View raw message