airavata-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apoorv Palkar <>
Subject Re: Apache Flink Execution
Date Mon, 12 Jun 2017 20:19:32 GMT
Same deal i've found with spark. Many generic data processing is performed by spark such as
map,reduce, filter. If we want to make it work, you need to add own implementation which could
potentially b a problem. 

-----Original Message-----
From: Shenoy, Gourav Ganesh <>
To: dev <>
Sent: Mon, Jun 12, 2017 4:16 pm
Subject: Re: Apache Flink Execution

Hi Dev,
After doing some more readings and playing around with Storm & Flink code examples, I
am now of the opinion that – although Flink provides us with certain benefits over Storm
(see prev. email) – integrating Flink to suit the Airavata use case might not work. The
reasons are as follows:
1.      Implementing custom functions/task-executors in Flink is not as straight forward as
in Storm (bolts) – Flink uses the concept of dataset and transformations. The notion is
that we define the data (bounded/unbounded), and apply transformations on this data – which
is defining operators to transform the input data to output data. The problem here is that
these transformations which Flink accept are limited to generic data processing, such as MAP,
REDUCE, JOIN, GROUP-BY, KEY-BY, AGGREGATE, etc. The only flexibility is we can define our
own implementations of these generic transformation APIs.

In constrast, for Airavata we need much more complicated implementations of task executors.
These generic transformations are of no use in Airavata as they only target stream processing
use cases, eg: if you have a dataset of calls made between two people and the duration of
call, we can override the MAP and GROUP functions to provide a transformed dataset with <call,
totalduration>. Similarly word count example.
2.      Although Flink claims to support bounded dataset (as opposed to Storm which needs
unbounded data – can be tweaked to handle bounded data, but support not available natively),
the datasets needs to be a Collection/Tuple (in most cases).
3.      The thing that troubles me the most is the fact that there is NO way to define custom
executors and invoke them in manner in which we anticipate. Eg: We would ideally want to deploy/enable
task executors – Job-Submission, Data-Staging, Monitoring, etc – on workers, and then
create a DAG to invoke them. This capability is available in Storm via Topology (DAG), Spouts
(dataset) and Bolts (executors). But in Flink, it’s more of how we can apply some kind of
transformation on the incoming dataset and generate a new dataset – it could be either aggregating
records, breaking sentences to words and grouping same words to count them, etc.
The only positive I observed was the ability to create STORM topology in Flink – but this
is more of a backward compatibility support, where user applications written in Storm needs
to be migrated to Flink. I am not an expert in Flink, so what I’ve pointed above is an understanding
after reading literature and running by the code examples. Anyone who is has worked in Flink,
please feel free to provide your inputs.
Thanks and Regards,
Gourav Shenoy

From: "Shenoy, Gourav Ganesh" <>
Reply-To: "" <>
Date: Wednesday, June 7, 2017 at 11:12 AM
To: "" <>
Subject: Re: Apache Flink Execution


Hi dev,
I did some literature reading about Storm vs Flink, with an emphasis on our use-case of Distributed
Task Execution and my initial impressions are as follows (I will also be updating the Google
docs accordingly):
1.     Although both Storm and Flink engines appear to be similar, for supporting pipeline
processing; Storm can only handle data streams, whereas Flink supports stream and batch processing.
This allows Flink to perform data transfer between parallel tasks – we do not have such
support as of today, but we can definitely think of parallel task execution.
2.     Storm supports at-least once and at-most once data processing, whereas Flink guarantees
exactly-once processing. Storm also supports exactly-once via their Trident API. From what
I read, Flink claims to be more efficient in terms of processing semantics – as they use
a lighter algorithm for check-pointing data transfers.
3.     There are high level APIs available in Flink to simplify the data collection process,
which is a little tedious in Storm. In Storm one needs to manually implement readers and collectors,
whereas Flink provides functions such as Map, GroupBy, Window and Join.
4.     A major positive in Flink is the ability to maintain custom State information in operators/executors.
This custom state information can also be used in check-pointing for fault tolerance.
I think Flink is an improvement over Storm, but this is just an understanding from my initial
readings. I haven’t yet tried coding any examples in Flink. Again, most of the features/differences
mentioned above, offered by both Storm and Flink, are for stream processing with focus on
executing a large number of small tasks (in parallel?) with continuous streaming data and
therefore the fight is for offering low latency processing; these might not necessarily be
that important for the Airavata use-case (tasks may take time to complete).
Thanks and Regards,
Gourav Shenoy

From: "Pierce, Marlon" <>
Reply-To: <>
Date: Wednesday, May 24, 2017 at 11:36 AM
To: "" <>
Subject: Re: Apache Flink Execution


Thanks, Apoorv.  Note for everyone else: request access if you’d like to leave a comment
or make a suggestion.

From: Apoorv Palkar <>
Reply-To: "" <>
Date: Wednesday, May 24, 2017 at 11:32 AM
To: "" <>
Subject: Apache Flink Execution


LINK for Flink Use/fundamental

View raw message