flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Ewen <se...@apache.org>
Subject Re: What is the equivalent of Spark RDD is Flink
Date Thu, 31 Dec 2015 00:37:15 GMT
Concerning question (2):

DataSets in Flink are in most cases not materialized at all, but they
represent in-flight data as it is being streamed from one operation to the
next (remember, Flink is streaming in its core). So even in a MapReduce
style program, the DataSet produced by the Map Function does never exist as
a whole, but is continuously produced and streamed to the ReduceFunction.

The operator that executes the ReduceFunction materializes the data as part
of its sorting operation. All materializing batch operations (sort / hash /
cache / ...) can go out of core very reliably.

Greetings,
Stephan



On Wed, Dec 30, 2015 at 4:45 AM, Sourav Mazumder <
sourav.mazumder00@gmail.com> wrote:

> Hi Aljoscha and Chiwan,
>
> Firstly thanks for the inputs.
>
> Couple of follow ups -
>
> 1. Based on Chiwan's explanation and the links my understanding is
> potential performance difference may happen between Spark and Flink (during
> iterative computation like building a model using a Machine Learning
> algorithm) across two iterations because of the overhead of starting a new
> set of tasks/operators.Other overheads would be same as both stores the
> intermediate results in memory. Is this understanding correct ?
>
> 2. In case of Flink what happens if a DataSet needs to contain data which
> is volume wise more than total memory available in all the slave nodes ?
> Will it serialize the memory in the disks of respective slave nodes by
> default ?
>
> Regards,
> Sourav
>
>
> On Mon, Dec 28, 2015 at 4:13 PM, Chiwan Park <chiwanpark@apache.org>
> wrote:
>
>> Hi Filip,
>>
>> Spark executes job also lazily. But It is slightly different from Flink.
>> Flink can execute lazily a whole job which Spark cannot execute lazily. One
>> of example is iterative job.
>>
>> In Spark, each stage of the iteration is submitted, scheduled as a job
>> and executed because of calling action in last of each iteration. In Flink,
>> although the job contains iteration, user submits only a job. Flink cluster
>> schedules and runs the job once.
>>
>> Because of this difference, in Spark, user must determine something more
>> such as “Which RDDs are cached or uncached?”.
>>
>> In 22 page and 23 page of ApacheCon EU 2014 slides [1] and Fabian’s
>> answer in SO [2] would be helpful to understand this differences. :)
>>
>> [1]: http://www.slideshare.net/GyulaFra/flink-apachecon
>> [2]:
>> http://stackoverflow.com/questions/29780747/apache-flink-vs-apache-spark-as-platforms-for-large-scale-machine-learning
>>
>> > On Dec 29, 2015, at 1:35 AM, Filip Łęczycki <filipleczycki@gmail.com>
>> wrote:
>> >
>> > Hi Aljoscha,
>> >
>> > Sorry for a little off-topic, but I wanted to calrify whether my
>> understanding is right. You said that "Contrary to Spark, a Flink job is
>> executed lazily", however as I read in available sources, for example
>> http://spark.apache.org/docs/latest/programming-guide.html, chapter "RDD
>> operations" : ". The transformations are only computed when an action
>> requires a result to be returned to the driver program.". To my
>> understanding Spark implements the same lazy execution principle as Flink,
>> that is the job is only executed when a data sink/action/execute is called
>> and before that only a execution plan is built. Is that correct or are
>> there other significant differences between Spark and Flink lazy execution
>> approach that I failed to grasp?
>> >
>> > Best regards,
>> > Filip Łęczycki
>> >
>> > Pozdrawiam,
>> > Filip Łęczycki
>> >
>> > 2015-12-25 10:17 GMT+01:00 Aljoscha Krettek <aljoscha@apache.org>:
>> > Hi Sourav,
>> > you are right, in Flink the equivalent to an RDD would be a DataSet (or
>> a DataStream if you are working with the streaming API).
>> >
>> > Contrary to Spark, a Flink job is executed lazily when
>> ExecutionEnvironment.execute() is called. Only then does Flink build an
>> executable program from the graph of transformations that was built by
>> calling the transformation methods on DataSet. That’s why I called it lazy.
>> The operations will also be automatically parallelized. The parallelism of
>> operations can either be configured in the cluster configuration
>> (conf/flink-conf.yaml), on a per job basis
>> (ExecutionEnvironment.setParallelism(int)) or per operation, by calling
>> setParallelism(int) on a DataSet.
>> >
>> > (Above you can always replace DataSet by DataStream, the same
>> explanations hold.)
>> >
>> > So, to get back to your question, yes, the operation of reading the
>> file (or files in a directory) will be parallelized to several worker nodes
>> based on the previously mentioned settings.
>> >
>> > Let us now if you need more information.
>> >
>> > Cheers,
>> > Aljoscha
>> >
>> > On Thu, 24 Dec 2015 at 16:49 Sourav Mazumder <
>> sourav.mazumder00@gmail.com> wrote:
>> > Hi,
>> >
>> > I am new to Flink. Trying to understand some of the basics of Flink.
>> >
>> > What is the equivalent of Spark's RDD in Flink ? In my understanding
>> the closes think is DataSet API. But wanted to reconfirm.
>> >
>> > Also using DataSet API if I ingest a large volume of data (val lines :
>> DataSet[String] = env.readTextFile(<some file path and name>)), which may
>> not fit in single slave node, will that data get automatically distributed
>> in the memory of other slave nodes ?
>> >
>> > Regards,
>> > Sourav
>> >
>>
>> Regards,
>> Chiwan Park
>>
>>
>>
>>
>

Mime
View raw message