Return-Path: X-Original-To: apmail-flink-user-archive@minotaur.apache.org Delivered-To: apmail-flink-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0BA5B183EE for ; Thu, 31 Dec 2015 04:08:39 +0000 (UTC) Received: (qmail 13072 invoked by uid 500); 31 Dec 2015 04:08:33 -0000 Delivered-To: apmail-flink-user-archive@flink.apache.org Received: (qmail 12980 invoked by uid 500); 31 Dec 2015 04:08:33 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 12971 invoked by uid 99); 31 Dec 2015 04:08:33 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 31 Dec 2015 04:08:33 +0000 Received: from [172.30.65.19] (unknown [211.55.87.200]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id CFEF91A0015 for ; Thu, 31 Dec 2015 04:08:32 +0000 (UTC) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.2 \(3112\)) Subject: Re: What is the equivalent of Spark RDD is Flink From: Chiwan Park In-Reply-To: Date: Thu, 31 Dec 2015 13:08:29 +0900 Content-Transfer-Encoding: quoted-printable Message-Id: <3C238724-F0EA-457E-A9F9-B8253CA286BE@apache.org> References: To: user@flink.apache.org X-Mailer: Apple Mail (2.3112) About question 1, Scheduling once for iterative job is one of factors causing performance = difference. Dongwon=E2=80=99s slides [1] would be helpful other factors = of performance. [1] = http://flink-forward.org/?session=3Da-comparative-performance-evaluation-o= f-flink > On Dec 31, 2015, at 9:37 AM, Stephan Ewen wrote: >=20 > Concerning question (2): >=20 > DataSets in Flink are in most cases not materialized at all, but they = represent in-flight data as it is being streamed from one operation to = the next (remember, Flink is streaming in its core). So even in a = MapReduce style program, the DataSet produced by the Map Function does = never exist as a whole, but is continuously produced and streamed to the = ReduceFunction. >=20 > The operator that executes the ReduceFunction materializes the data as = part of its sorting operation. All materializing batch operations (sort = / hash / cache / ...) can go out of core very reliably. >=20 > Greetings, > Stephan >=20 >=20 >=20 > On Wed, Dec 30, 2015 at 4:45 AM, Sourav Mazumder = wrote: > Hi Aljoscha and Chiwan, >=20 > Firstly thanks for the inputs. >=20 > Couple of follow ups - >=20 > 1. Based on Chiwan's explanation and the links my understanding is = potential performance difference may happen between Spark and Flink = (during iterative computation like building a model using a Machine = Learning algorithm) across two iterations because of the overhead of = starting a new set of tasks/operators.Other overheads would be same as = both stores the intermediate results in memory. Is this understanding = correct ? >=20 > 2. In case of Flink what happens if a DataSet needs to contain data = which is volume wise more than total memory available in all the slave = nodes ? Will it serialize the memory in the disks of respective slave = nodes by default ? >=20 > Regards, > Sourav >=20 >=20 > On Mon, Dec 28, 2015 at 4:13 PM, Chiwan Park = wrote: > Hi Filip, >=20 > Spark executes job also lazily. But It is slightly different from = Flink. Flink can execute lazily a whole job which Spark cannot execute = lazily. One of example is iterative job. >=20 > In Spark, each stage of the iteration is submitted, scheduled as a job = and executed because of calling action in last of each iteration. In = Flink, although the job contains iteration, user submits only a job. = Flink cluster schedules and runs the job once. >=20 > Because of this difference, in Spark, user must determine something = more such as =E2=80=9CWhich RDDs are cached or uncached?=E2=80=9D. >=20 > In 22 page and 23 page of ApacheCon EU 2014 slides [1] and Fabian=E2=80=99= s answer in SO [2] would be helpful to understand this differences. :) >=20 > [1]: http://www.slideshare.net/GyulaFra/flink-apachecon > [2]: = http://stackoverflow.com/questions/29780747/apache-flink-vs-apache-spark-a= s-platforms-for-large-scale-machine-learning >=20 > > On Dec 29, 2015, at 1:35 AM, Filip =C5=81=C4=99czycki = wrote: > > > > Hi Aljoscha, > > > > Sorry for a little off-topic, but I wanted to calrify whether my = understanding is right. You said that "Contrary to Spark, a Flink job is = executed lazily", however as I read in available sources, for example = http://spark.apache.org/docs/latest/programming-guide.html, chapter "RDD = operations" : ". The transformations are only computed when an action = requires a result to be returned to the driver program.". To my = understanding Spark implements the same lazy execution principle as = Flink, that is the job is only executed when a data sink/action/execute = is called and before that only a execution plan is built. Is that = correct or are there other significant differences between Spark and = Flink lazy execution approach that I failed to grasp? > > > > Best regards, > > Filip =C5=81=C4=99czycki > > > > Pozdrawiam, > > Filip =C5=81=C4=99czycki > > > > 2015-12-25 10:17 GMT+01:00 Aljoscha Krettek : > > Hi Sourav, > > you are right, in Flink the equivalent to an RDD would be a DataSet = (or a DataStream if you are working with the streaming API). > > > > Contrary to Spark, a Flink job is executed lazily when = ExecutionEnvironment.execute() is called. Only then does Flink build an = executable program from the graph of transformations that was built by = calling the transformation methods on DataSet. That=E2=80=99s why I = called it lazy. The operations will also be automatically parallelized. = The parallelism of operations can either be configured in the cluster = configuration (conf/flink-conf.yaml), on a per job basis = (ExecutionEnvironment.setParallelism(int)) or per operation, by calling = setParallelism(int) on a DataSet. > > > > (Above you can always replace DataSet by DataStream, the same = explanations hold.) > > > > So, to get back to your question, yes, the operation of reading the = file (or files in a directory) will be parallelized to several worker = nodes based on the previously mentioned settings. > > > > Let us now if you need more information. > > > > Cheers, > > Aljoscha > > > > On Thu, 24 Dec 2015 at 16:49 Sourav Mazumder = wrote: > > Hi, > > > > I am new to Flink. Trying to understand some of the basics of Flink. > > > > What is the equivalent of Spark's RDD in Flink ? In my understanding = the closes think is DataSet API. But wanted to reconfirm. > > > > Also using DataSet API if I ingest a large volume of data (val lines = : DataSet[String] =3D env.readTextFile()), = which may not fit in single slave node, will that data get automatically = distributed in the memory of other slave nodes ? > > > > Regards, > > Sourav > > >=20 > Regards, > Chiwan Park Regards, Chiwan Park