flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Samarth Mailinglist <mailinglistsama...@gmail.com>
Subject Re: Flink and Spark
Date Fri, 26 Dec 2014 06:31:09 GMT
Thank you the answers, folks.
Can anyone provide me a link for any implementation of an ML algorithm on
Flink?

On Thu, Dec 25, 2014 at 8:07 PM, Gyula Fóra <gyfora@apache.org> wrote:

> Hey,
>
> 1-2. As for failure recovery, there is a difference how the Flink batch
> and streaming programs handle failures. The failed parts of the batch jobs
> currently restart upon failures but there is an ongoing effort on fine
> grained fault tolerance which is somewhat similar to sparks lineage
> tracking. (so technically this is exactly once semantic but that is
> somewhat meaningless for batch jobs)
>
> For streaming programs we are currently working on fault tolerance, we are
> hoping to support at least once processing guarantees in the 0.9 release.
> After that we will focus our research efforts on an high performance
> implementation of exactly once processing semantics, which is still a hard
> topic in streaming systems. Storm's trident's exaclty once semantics can
> only provide very low throughput while we are trying hard to avoid this
> issue, as our streaming system is capable of much higher throughput than
> storm in general as you can see on some perf measurements.
>
> 3. There are already many ml algorithms implemented for Flink but they are
> scattered all around. We are planning to collect them in a machine learning
> library soon. We are also implementing an adapter for Samoa which will
> provide some streaming machine learning algorithms as well. Samoa
> integration should be ready in January.
>
> 4. Flink carefully manages its memory use to avoid heap errors, and
> utilizing memory space as effectively as it can. The optimizer for batch
> programs also takes care of a lot of optimization steps that the user would
> manually have to do in other system, like optimizing the order of
> transformations etc. There are of course parts of the program that still
> needs to modified for maximal performance, for example parallelism settings
> for some operators in some cases.
>
> 5. As for the status of the Python API I personally cannot say very much,
> maybe someone can jump in and help me with that question :)
>
> Regards,
> Gyula
>
> On Thu, Dec 25, 2014 at 11:58 AM, Samarth Mailinglist <
> mailinglistsamarth@gmail.com> wrote:
>
>> Thank you for your answer. I have a couple of follow up questions:
>> 1. Does it support 'exactly once semantics' that Spark and Storm support?
>> 2. (Related to 1) What happens when an error occurs during processing?
>> 3. Is there a plan for adding Machine Learning support on top of Flink?
>> Say Alternative Least Squares, Basic Naive Bayes?
>> 4. When you say Flink manages itself, does it mean I don't have to fiddle
>> with number of partitions (Spark), number of reduces / happers (Hadoop?) to
>> optimize performance? (In some cases this might be needed)
>> 5. How far along is the Python API? I don't see the specs in the Website.
>>
>> On Thu, Dec 25, 2014 at 4:31 AM, Márton Balassi <mbalassi@apache.org>
>> wrote:
>>
>>> Dear Samarth,
>>>
>>> Besides the discussions you have mentioned [1] I can recommend one of
>>> our recent presentations [2], especially the distinguishing Flink section
>>> (from slide 16).
>>>
>>> It is generally a difficult question as both the systems are rapidly
>>> evolving, so the answer can become outdated quite fast. However there are
>>> fundamental design features that are highly unlikely to change, for example
>>> Spark uses "true" batch processing, meaning that intermediate results are
>>> materialized (mostly in memory) as RDDs. Flink's engine is internally more
>>> like streaming, forwarding the results to the next operator asap. The
>>> latter can yield performance benefits for more complex jobs. Flink also
>>> gives you a query optimizer, spills gracefully to disk when the system runs
>>> out of memory and has some cool features around serialization. For
>>> performance numbers and some more insight please check out the presentation
>>> [2] and do not hesitate to post a follow-up mail here if you come across
>>> something unclear or extraordinary.
>>>
>>> [1]
>>> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark
>>> [2] http://www.slideshare.net/GyulaFra/flink-apachecon
>>>
>>> Best,
>>>
>>> Marton
>>>
>>> On Tue, Dec 23, 2014 at 6:19 PM, Samarth Mailinglist <
>>> mailinglistsamarth@gmail.com> wrote:
>>>
>>>> Hey folks, I have a noob question.
>>>>
>>>> I already looked up the archives and saw a couple of discussions
>>>> <http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/template/NamlServlet.jtp?macro=search_page&node=1&query=spark>
>>>> about Spark and Flink.
>>>>
>>>> I am familiar with spark (the python API, esp MLLib), and I see many
>>>> similarities between Flink and Spark.
>>>>
>>>> How does Flink distinguish itself from Spark?
>>>>
>>>
>>>
>>
>

Mime
View raw message