flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chiwan Park <chiwanp...@apache.org>
Subject Re: What is the equivalent of Spark RDD is Flink
Date Thu, 31 Dec 2015 04:08:29 GMT
About question 1,

Scheduling once for iterative job is one of factors causing performance difference. Dongwon’s
slides [1] would be helpful other factors of performance.

[1] http://flink-forward.org/?session=a-comparative-performance-evaluation-of-flink

> On Dec 31, 2015, at 9:37 AM, Stephan Ewen <sewen@apache.org> wrote:
> 
> Concerning question (2):
> 
> DataSets in Flink are in most cases not materialized at all, but they represent in-flight
data as it is being streamed from one operation to the next (remember, Flink is streaming
in its core). So even in a MapReduce style program, the DataSet produced by the Map Function
does never exist as a whole, but is continuously produced and streamed to the ReduceFunction.
> 
> The operator that executes the ReduceFunction materializes the data as part of its sorting
operation. All materializing batch operations (sort / hash / cache / ...) can go out of core
very reliably.
> 
> Greetings,
> Stephan
> 
> 
> 
> On Wed, Dec 30, 2015 at 4:45 AM, Sourav Mazumder <sourav.mazumder00@gmail.com>
wrote:
> Hi Aljoscha and Chiwan,
> 
> Firstly thanks for the inputs.
> 
> Couple of follow ups -
> 
> 1. Based on Chiwan's explanation and the links my understanding is potential performance
difference may happen between Spark and Flink (during iterative computation like building
a model using a Machine Learning algorithm) across two iterations because of the overhead
of starting a new set of tasks/operators.Other overheads would be same as both stores the
intermediate results in memory. Is this understanding correct ?
> 
> 2. In case of Flink what happens if a DataSet needs to contain data which is volume wise
more than total memory available in all the slave nodes ? Will it serialize the memory in
the disks of respective slave nodes by default ?
> 
> Regards,
> Sourav
> 
> 
> On Mon, Dec 28, 2015 at 4:13 PM, Chiwan Park <chiwanpark@apache.org> wrote:
> Hi Filip,
> 
> Spark executes job also lazily. But It is slightly different from Flink. Flink can execute
lazily a whole job which Spark cannot execute lazily. One of example is iterative job.
> 
> In Spark, each stage of the iteration is submitted, scheduled as a job and executed because
of calling action in last of each iteration. In Flink, although the job contains iteration,
user submits only a job. Flink cluster schedules and runs the job once.
> 
> Because of this difference, in Spark, user must determine something more such as “Which
RDDs are cached or uncached?”.
> 
> In 22 page and 23 page of ApacheCon EU 2014 slides [1] and Fabian’s answer in SO [2]
would be helpful to understand this differences. :)
> 
> [1]: http://www.slideshare.net/GyulaFra/flink-apachecon
> [2]: http://stackoverflow.com/questions/29780747/apache-flink-vs-apache-spark-as-platforms-for-large-scale-machine-learning
> 
> > On Dec 29, 2015, at 1:35 AM, Filip Łęczycki <filipleczycki@gmail.com> wrote:
> >
> > Hi Aljoscha,
> >
> > Sorry for a little off-topic, but I wanted to calrify whether my understanding is
right. You said that "Contrary to Spark, a Flink job is executed lazily", however as I read
in available sources, for example http://spark.apache.org/docs/latest/programming-guide.html,
chapter "RDD operations" : ". The transformations are only computed when an action requires
a result to be returned to the driver program.". To my understanding Spark implements the
same lazy execution principle as Flink, that is the job is only executed when a data sink/action/execute
is called and before that only a execution plan is built. Is that correct or are there other
significant differences between Spark and Flink lazy execution approach that I failed to grasp?
> >
> > Best regards,
> > Filip Łęczycki
> >
> > Pozdrawiam,
> > Filip Łęczycki
> >
> > 2015-12-25 10:17 GMT+01:00 Aljoscha Krettek <aljoscha@apache.org>:
> > Hi Sourav,
> > you are right, in Flink the equivalent to an RDD would be a DataSet (or a DataStream
if you are working with the streaming API).
> >
> > Contrary to Spark, a Flink job is executed lazily when ExecutionEnvironment.execute()
is called. Only then does Flink build an executable program from the graph of transformations
that was built by calling the transformation methods on DataSet. That’s why I called it
lazy. The operations will also be automatically parallelized. The parallelism of operations
can either be configured in the cluster configuration (conf/flink-conf.yaml), on a per job
basis (ExecutionEnvironment.setParallelism(int)) or per operation, by calling setParallelism(int)
on a DataSet.
> >
> > (Above you can always replace DataSet by DataStream, the same explanations hold.)
> >
> > So, to get back to your question, yes, the operation of reading the file (or files
in a directory) will be parallelized to several worker nodes based on the previously mentioned
settings.
> >
> > Let us now if you need more information.
> >
> > Cheers,
> > Aljoscha
> >
> > On Thu, 24 Dec 2015 at 16:49 Sourav Mazumder <sourav.mazumder00@gmail.com>
wrote:
> > Hi,
> >
> > I am new to Flink. Trying to understand some of the basics of Flink.
> >
> > What is the equivalent of Spark's RDD in Flink ? In my understanding the closes
think is DataSet API. But wanted to reconfirm.
> >
> > Also using DataSet API if I ingest a large volume of data (val lines : DataSet[String]
= env.readTextFile(<some file path and name>)), which may not fit in single slave node,
will that data get automatically distributed in the memory of other slave nodes ?
> >
> > Regards,
> > Sourav
> >
> 
> Regards,
> Chiwan Park

Regards,
Chiwan Park


Mime
View raw message