Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
Content-Type: text/plain; charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 9.2 \(3112\))
Subject: Re: What is the equivalent of Spark RDD is Flink
From: Chiwan Park <chiwanpark@apache.org>
In-Reply-To: 
 <CANC1h_v9J8_csSvZa_zzRSYs4uP_RrierNNi5q351HgQ4c+g6w@mail.gmail.com>
Date: Thu, 31 Dec 2015 13:08:29 +0900
Content-Transfer-Encoding: quoted-printable
Message-Id: <3C238724-F0EA-457E-A9F9-B8253CA286BE@apache.org>
References: 
 <CAFy6k11gv0ZFbwS0ZZvL1CodGoYgScx9DEZXZtygNU8Y9Dbk3Q@mail.gmail.com>
 <CANMXwW2EPx5Mpyq+1eCw8fcJA=gM2pd_FqNrhpSOATL74EO6bw@mail.gmail.com>
 <CAKjdV7f_mmea9FojXsgkUa3kY0X=QifEQsOyRA9A-d941mW6Kg@mail.gmail.com>
 <BAB12DCF-E7FE-4FED-8E87-82E24C9FFDA5@apache.org>
 <CAFy6k13C77PzhoPgdo3G6nAf9yhdOaJz1E3M9_AHgiH09E0ikQ@mail.gmail.com>
 <CANC1h_v9J8_csSvZa_zzRSYs4uP_RrierNNi5q351HgQ4c+g6w@mail.gmail.com>
To: user@flink.apache.org

About question 1,

Scheduling once for iterative job is one of factors causing performance =
difference. Dongwon=E2=80=99s slides [1] would be helpful other factors =
of performance.

[1] =
http://flink-forward.org/?session=3Da-comparative-performance-evaluation-o=
f-flink

> On Dec 31, 2015, at 9:37 AM, Stephan Ewen <sewen@apache.org> wrote:
>=20
> Concerning question (2):
>=20
> DataSets in Flink are in most cases not materialized at all, but they =
represent in-flight data as it is being streamed from one operation to =
the next (remember, Flink is streaming in its core). So even in a =
MapReduce style program, the DataSet produced by the Map Function does =
never exist as a whole, but is continuously produced and streamed to the =
ReduceFunction.
>=20
> The operator that executes the ReduceFunction materializes the data as =
part of its sorting operation. All materializing batch operations (sort =
/ hash / cache / ...) can go out of core very reliably.
>=20
> Greetings,
> Stephan
>=20
>=20
>=20
> On Wed, Dec 30, 2015 at 4:45 AM, Sourav Mazumder =
<sourav.mazumder00@gmail.com> wrote:
> Hi Aljoscha and Chiwan,
>=20
> Firstly thanks for the inputs.
>=20
> Couple of follow ups -
>=20
> 1. Based on Chiwan's explanation and the links my understanding is =
potential performance difference may happen between Spark and Flink =
(during iterative computation like building a model using a Machine =
Learning algorithm) across two iterations because of the overhead of =
starting a new set of tasks/operators.Other overheads would be same as =
both stores the intermediate results in memory. Is this understanding =
correct ?
>=20
> 2. In case of Flink what happens if a DataSet needs to contain data =
which is volume wise more than total memory available in all the slave =
nodes ? Will it serialize the memory in the disks of respective slave =
nodes by default ?
>=20
> Regards,
> Sourav
>=20
>=20
> On Mon, Dec 28, 2015 at 4:13 PM, Chiwan Park <chiwanpark@apache.org> =
wrote:
> Hi Filip,
>=20
> Spark executes job also lazily. But It is slightly different from =
Flink. Flink can execute lazily a whole job which Spark cannot execute =
lazily. One of example is iterative job.
>=20
> In Spark, each stage of the iteration is submitted, scheduled as a job =
and executed because of calling action in last of each iteration. In =
Flink, although the job contains iteration, user submits only a job. =
Flink cluster schedules and runs the job once.
>=20
> Because of this difference, in Spark, user must determine something =
more such as =E2=80=9CWhich RDDs are cached or uncached?=E2=80=9D.
>=20
> In 22 page and 23 page of ApacheCon EU 2014 slides [1] and Fabian=E2=80=99=
s answer in SO [2] would be helpful to understand this differences. :)
>=20
> [1]: http://www.slideshare.net/GyulaFra/flink-apachecon
> [2]: =
http://stackoverflow.com/questions/29780747/apache-flink-vs-apache-spark-a=
s-platforms-for-large-scale-machine-learning
>=20
> > On Dec 29, 2015, at 1:35 AM, Filip =C5=81=C4=99czycki =
<filipleczycki@gmail.com> wrote:
> >
> > Hi Aljoscha,
> >
> > Sorry for a little off-topic, but I wanted to calrify whether my =
understanding is right. You said that "Contrary to Spark, a Flink job is =
executed lazily", however as I read in available sources, for example =
http://spark.apache.org/docs/latest/programming-guide.html, chapter "RDD =
operations" : ". The transformations are only computed when an action =
requires a result to be returned to the driver program.". To my =
understanding Spark implements the same lazy execution principle as =
Flink, that is the job is only executed when a data sink/action/execute =
is called and before that only a execution plan is built. Is that =
correct or are there other significant differences between Spark and =
Flink lazy execution approach that I failed to grasp?
> >
> > Best regards,
> > Filip =C5=81=C4=99czycki
> >
> > Pozdrawiam,
> > Filip =C5=81=C4=99czycki
> >
> > 2015-12-25 10:17 GMT+01:00 Aljoscha Krettek <aljoscha@apache.org>:
> > Hi Sourav,
> > you are right, in Flink the equivalent to an RDD would be a DataSet =
(or a DataStream if you are working with the streaming API).
> >
> > Contrary to Spark, a Flink job is executed lazily when =
ExecutionEnvironment.execute() is called. Only then does Flink build an =
executable program from the graph of transformations that was built by =
calling the transformation methods on DataSet. That=E2=80=99s why I =
called it lazy. The operations will also be automatically parallelized. =
The parallelism of operations can either be configured in the cluster =
configuration (conf/flink-conf.yaml), on a per job basis =
(ExecutionEnvironment.setParallelism(int)) or per operation, by calling =
setParallelism(int) on a DataSet.
> >
> > (Above you can always replace DataSet by DataStream, the same =
explanations hold.)
> >
> > So, to get back to your question, yes, the operation of reading the =
file (or files in a directory) will be parallelized to several worker =
nodes based on the previously mentioned settings.
> >
> > Let us now if you need more information.
> >
> > Cheers,
> > Aljoscha
> >
> > On Thu, 24 Dec 2015 at 16:49 Sourav Mazumder =
<sourav.mazumder00@gmail.com> wrote:
> > Hi,
> >
> > I am new to Flink. Trying to understand some of the basics of Flink.
> >
> > What is the equivalent of Spark's RDD in Flink ? In my understanding =
the closes think is DataSet API. But wanted to reconfirm.
> >
> > Also using DataSet API if I ingest a large volume of data (val lines =
: DataSet[String] =3D env.readTextFile(<some file path and name>)), =
which may not fit in single slave node, will that data get automatically =
distributed in the memory of other slave nodes ?
> >
> > Regards,
> > Sourav
> >
>=20
> Regards,
> Chiwan Park

Regards,
Chiwan Park