Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
MIME-Version: 1.0
In-Reply-To: <CAK2vteq5a62pRe-zasWt16iibvHHYDaUUEEtX7zPgZU6cxWh-A@mail.gmail.com>
References: <CAK2vteq5a62pRe-zasWt16iibvHHYDaUUEEtX7zPgZU6cxWh-A@mail.gmail.com>
From: CPC <achalil@gmail.com>
Date: Fri, 18 Nov 2016 10:26:50 +0300
Message-ID: <CAK2vterbn4sjcVOCKg+JPe07Ms4offvmT9rL1Tp0wnLovh9q0g@mail.gmail.com>
Subject: Re: spark vs flink batch performance
To: user@flink.apache.org
Content-Type: multipart/alternative; boundary=001a114b7e1ecf4ef405418e3aca
archived-at: Fri, 18 Nov 2016 07:26:57 -0000

--001a114b7e1ecf4ef405418e3aca
Content-Type: text/plain; charset=UTF-8

Hi all,

In the mean time i have three workers. Any thoughts about improving flink
performance?

Thank you...

On Nov 17, 2016 00:38, "CPC" <achalil@gmail.com> wrote:

> Hi all,
>
> I am trying to compare spark and flink batch performance. In my test i am
> using ratings.csv in http://files.grouplens.org/
> datasets/movielens/ml-latest.zip dataset. I also concatenated ratings.csv
> 16 times to increase dataset size(total of 390465536 records almost 10gb).I
> am reading from google storage with gcs-connector and  file schema is :
> userId,movieId,rating,timestamp. Basically i am calculating average
> rating per movie
>
> Code for flink(i tested CombineHint.HASH and CombineHint.SORT)
>
> case class Rating(userID: String, movieID: String, rating: Double, date:
>> Timestamp)
>>
>
>
>> def parseRating(line: String): Rating = {
>>   val arr = line.split(",")
>>   Rating(arr(0), arr(1), arr(2).toDouble, new Timestamp((arr(3).toLong *
>> 1000)))
>> }
>
>
>
> val ratings: DataSet[Rating] = env.readTextFile("gs://cpcflink/wikistream/ratingsheadless16x.csv").map(a
>> => parseRating(a))
>> ratings
>>   .map(i => (i.movieID, 1, i.rating))
>>   .groupBy(0).reduce((l, r) => (l._1, l._2 + r._2, l._3 + r._3),
>> CombineHint.HASH)
>>   .map(i => (i._1, i._3 / i._2)).collect().sortBy(_._1).
>> sortBy(_._2)(Ordering.Double.reverse).take(10)
>
>
> with CombineHint.HASH 3m49s and with CombineHint.SORT 5m9s
>
> Code for Spark(i tested reduceByKey and reduceByKeyLocaly)
>
>> case class Rating(userID: String, movieID: String, rating: Double, date:
>> Timestamp)
>> def parseRating(line: String): Rating = {
>>   val arr = line.split(",")
>>   Rating(arr(0), arr(1), arr(2).toDouble, new Timestamp((arr(3).toLong *
>> 1000)))
>> }
>> val conf = new SparkConf().setAppName("Simple Application")
>> val sc = new SparkContext(conf)
>> val keyed: RDD[(String, (Int, Double))] = sc.textFile("gs://cpcflink/
>> wikistream/ratingsheadless16x.csv").map(parseRating).map(r =>
>> (r.movieID, (1, r.rating)))
>> keyed.reduceByKey((l, r) => (l._1 + r._1, l._2 + r._2)).mapValues(i =>
>> i._2 / i._1).collect.sortBy(_._1).sortBy(a=>a._2)(Ordering.
>> Double.reverse).take(10).foreach(println)
>
>
> with reduceByKeyLocaly 2.9 minute(almost 2m54s) and reduceByKey 3.1
> minute(almost 3m6s)
>
> Machine config on google cloud:
> taskmanager/sparkmaster: n1-standard-1 (1 vCPU, 3.75 GB memory)
> jobmanager/sparkworkers: n1-standard-2 (2 vCPUs, 7.5 GB memory)
> java version:jdk jdk-8u102
> flink:1.1.3
> spark:2.0.2
>
> I also attached flink-conf.yaml. Although it is not such a big difference
> there is a 40% performance difference between spark and flink. Is there
> something i am doing wrong? If there is not how can i fine tune flink or is
> it normal spark has better performance with batch data?
>
> Thank you in advance...
>

--001a114b7e1ecf4ef405418e3aca
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<p dir=3D"ltr">Hi all,</p>
<p dir=3D"ltr">In the mean time i have three workers. Any thoughts about im=
proving flink performance?</p>
<p dir=3D"ltr">Thank you...</p>
<div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Nov 17, 2016 0=
0:38, &quot;CPC&quot; &lt;<a href=3D"mailto:achalil@gmail.com">achalil@gmai=
l.com</a>&gt; wrote:<br type=3D"attribution"><blockquote class=3D"gmail_quo=
te" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"=
><div dir=3D"ltr">Hi all,<div><br></div><div>I am trying to compare spark a=
nd flink batch performance. In my test i am using ratings.csv in <a href=3D=
"http://files.grouplens.org/datasets/movielens/ml-latest.zip" target=3D"_bl=
ank">http://files.grouplens.org/<wbr>datasets/movielens/ml-latest.<wbr>zip<=
/a> dataset. I also=C2=A0concatenated ratings.csv 16 times to increase data=
set size(total of 390465536 records almost 10gb).I am reading from google s=
torage with gcs-connector and =C2=A0file schema is : userId,movieId,rating,=
<wbr>timestamp. Basically i am calculating average rating per movie</div><d=
iv><br></div><div>Code for flink(i tested CombineHint.HASH and CombineHint.=
SORT)</div><blockquote style=3D"margin:0px 0px 0px 40px;border:none;padding=
:0px"></blockquote><blockquote class=3D"gmail_quote" style=3D"margin:0px 0p=
x 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">case c=
lass Rating(userID: String, movieID: String, rating: Double, date: Timestam=
p)<br></blockquote><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=
=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding=
-left:1ex">def parseRating(line: String): Rating =3D {<br>=C2=A0 val arr =
=3D line.split(&quot;,&quot;)<br>=C2=A0 Rating(arr(0), arr(1), arr(2).toDou=
ble, new Timestamp((arr(3).toLong * 1000)))<br>}</blockquote><div>=C2=A0</d=
iv><blockquote style=3D"margin:0px 0px 0px 40px;border:none;padding:0px"></=
blockquote><blockquote style=3D"margin:0px 0px 0px 40px;border:none;padding=
:0px"></blockquote><blockquote style=3D"margin:0px 0px 0px 40px;border:none=
;padding:0px"></blockquote><blockquote class=3D"gmail_quote" style=3D"margi=
n:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex=
">val ratings: DataSet[Rating] =3D env.readTextFile(&quot;gs://<wbr>cpcflin=
k/wikistream/<wbr>ratingsheadless16x.csv&quot;).map(a =3D&gt; parseRating(a=
))<br>ratings<br>=C2=A0 .map(i =3D&gt; (i.movieID, 1, i.rating))<br>=C2=A0 =
.groupBy(0).reduce((l, r) =3D&gt; (l._1, l._2 + r._2, l._3 + r._3), Combine=
Hint.HASH)<br>=C2=A0 .map(i =3D&gt; (i._1, i._3 / i._2)).collect().sortBy(_=
._1).<wbr>sortBy(_._2)(Ordering.Double.<wbr>reverse).take(10)</blockquote><=
div><br></div><div>with CombineHint.HASH 3m49s and with=C2=A0CombineHint.SO=
RT=C2=A05m9s</div><div><br></div><div>Code for Spark(i tested reduceByKey a=
nd reduceByKeyLocaly)</div><blockquote class=3D"gmail_quote" style=3D"margi=
n:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex=
">case class Rating(userID: String, movieID: String, rating: Double, date: =
Timestamp)<br>def parseRating(line: String): Rating =3D {<br>=C2=A0 val arr=
 =3D line.split(&quot;,&quot;)<br>=C2=A0 Rating(arr(0), arr(1), arr(2).toDo=
uble, new Timestamp((arr(3).toLong * 1000)))<br>}<br>val conf =3D new Spark=
Conf().setAppName(&quot;Simple Application&quot;)<br>val sc =3D new SparkCo=
ntext(conf)<br>val keyed: RDD[(String, (Int, Double))] =3D sc.textFile(&quo=
t;gs://cpcflink/<wbr>wikistream/ratingsheadless16x.<wbr>csv&quot;).map(pars=
eRating).map(r =3D&gt; (r.movieID, (1, r.rating)))<br>keyed.reduceByKey((l,=
 r) =3D&gt; (l._1 + r._1, l._2 + r._2)).mapValues(i =3D&gt; i._2 / i._1).co=
llect.sortBy(_._1).<wbr>sortBy(a=3D&gt;a._2)(Ordering.<wbr>Double.reverse).=
take(10).<wbr>foreach(println)</blockquote><div><br></div><div>with reduceB=
yKeyLocaly 2.9 minute(almost 2m54s) and reduceByKey 3.1 minute(almost 3m6s)=
=C2=A0</div><div><br></div><div>Machine config on google cloud:</div><div>t=
askmanager/sparkmaster:=C2=A0<span style=3D"color:rgba(0,0,0,0.870588);font=
-family:roboto,sans-serif;font-size:12px;letter-spacing:0.12px">n1-<wbr>sta=
ndard-1 (1 vCPU, 3.75 GB memory)</span></div><div><span style=3D"color:rgba=
(0,0,0,0.870588);font-family:roboto,sans-serif;font-size:12px;letter-spacin=
g:0.12px">jobmanager/sparkworkers:=C2=A0</span><span style=3D"color:rgba(0,=
0,0,0.870588);font-family:roboto,sans-serif;font-size:12px;letter-spacing:0=
.12px">n1-<wbr>standard-2 (2 vCPUs, 7.5 GB memory)</span></div><div><span s=
tyle=3D"color:rgba(0,0,0,0.870588);font-family:roboto,sans-serif;font-size:=
12px;letter-spacing:0.12px">java version:jdk=C2=A0</span>jdk-8u102<br></div=
><div>flink:1.1.3</div><div>spark:2.0.2</div><div><br></div><div>I also att=
ached flink-conf.yaml. Although it is not such a big difference there is a =
40% performance difference between spark and flink. Is there something i am=
 doing wrong? If there is not how can i fine tune flink or is it normal spa=
rk has better performance with batch data?</div><div><br></div><div>Thank y=
ou in advance...</div></div>
</blockquote></div></div>

--001a114b7e1ecf4ef405418e3aca--