Mailing-List: contact user-help@spark.apache.org; run by ezmlm
Precedence: bulk
Received-SPF: pass (nike.apache.org: domain of garjones@socialmetrix.com
 designates 209.85.192.54 as permitted sender)
From: Gustavo Arjones <garjones@socialmetrix.com>
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_81228049-48CE-49DF-BB08-E53952E7C982"
Message-Id: <3C6DF517-8A77-474C-AF73-EA402DF12730@socialmetrix.com>
Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\))
Subject: Re: Poor performance writing to S3
Date: Wed, 1 Oct 2014 15:39:32 -0300
References: <B8B3BA7F-5109-4A88-8BE2-115A67CD2F61@socialmetrix.com>
To: "user@spark.apache.org" <user@spark.apache.org>
In-Reply-To: <B8B3BA7F-5109-4A88-8BE2-115A67CD2F61@socialmetrix.com>

--Apple-Mail=_81228049-48CE-49DF-BB08-E53952E7C982
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=windows-1252

Hi,
I found the answer to my problem, and just writing to keep it as KB.

Turns out the problem wasn=92t related to S3 performance, it was due my =
SOURCE was not fast enough, due the lazy nature of Spark what I saw on =
the dashboard was saveAsTextFile at FacebookProcessor.scala:46 instead =
of the load method()

When I ran count() on my dataset before trying to save it to S3 I could =
figure out the input bottleneck.

- gustavo


On Sep 30, 2014, at 10:03 PM, Gustavo Arjones =
<garjones@socialmetrix.com> wrote:

> Hi,
> I=92m trying to save about a million of lines containing statistics =
data, something like:
>=20
> 233815212529_10152316612422530  233815212529_10152316612422530  =
1328569332      1404691200      1404691200      1402316275      46      =
0       0       7       0       0       0
> 233815212529_10152316612422530  233815212529_10152316612422530  =
1328569332      1404694800      1404694800      1402316275      46      =
0       0       7       0       0       0
> 233815212529_10152316612422530  233815212529_10152316612422530  =
1328569332      1404698400      1404698400      1402316275      46      =
0       0       7       0       0       0
> 233815212529_10152316612422530  233815212529_10152316612422530  =
1328569332      1404702000      1404702000      1402316275      46      =
0       0       7       0       0       0
>=20
> Using the standard saveAsTextFile with an optional codec (GzipCodec)
>=20
>     postsStats.saveAsTextFile(s"s3n://smx-spark/...../raw_data", =
classOf[GzipCodec])
>=20
> The resulting task is taking really long, i.e.: 3 hours to save 2Gb of =
data. I found some references and blog posts about to increase RDD =
partition to improve processing when READING from source.
>=20
> The oposite operation would improve WRITE operation, I mean, if a =
reduce the partitioning level can I avoid small file problem?
> Is it possible that GzipCodec affecting parallelism level and reducing =
the overall performance?
>=20
> I have 4 nodes m1.xlarge (1 master + 3 workers) on EC2 - standalone =
mode launched using spark-ec2script with version Spark 1.1.0
>=20
> Thanks a lot!
> - gustavo


--Apple-Mail=_81228049-48CE-49DF-BB08-E53952E7C982
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=windows-1252

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html =
charset=3Dwindows-1252"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div =
apple-content-edited=3D"true"><div =
apple-content-edited=3D"true">Hi,</div><div =
apple-content-edited=3D"true">I found the answer to my problem, and just =
writing to keep it as KB.</div><div =
apple-content-edited=3D"true"><br></div><div =
apple-content-edited=3D"true">Turns out the problem wasn=92t related to =
S3 performance, it was due my SOURCE was not fast enough, due the lazy =
nature of Spark what I saw on the dashboard was <b>saveAsTextFile at =
FacebookProcessor.scala:46</b> instead of the load method()</div><div =
apple-content-edited=3D"true"><br></div><div =
apple-content-edited=3D"true">When I ran count() on my dataset before =
trying to save it to S3 I could figure out the input =
bottleneck.</div><div><br></div></div><div apple-content-edited=3D"true">-=
 gustavo</div><div apple-content-edited=3D"true"><br></div>
<br><div><div>On Sep 30, 2014, at 10:03 PM, Gustavo Arjones &lt;<a =
href=3D"mailto:garjones@socialmetrix.com">garjones@socialmetrix.com</a>&gt=
; wrote:</div><br class=3D"Apple-interchange-newline"><blockquote =
type=3D"cite"><meta http-equiv=3D"Content-Type" content=3D"text/html =
charset=3Dwindows-1252"><div style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: =
after-white-space;">Hi,<div>I=92m trying to save about a million of =
lines containing statistics data, something =
like:</div><div><br></div><div><div><font face=3D"Courier New" =
size=3D"1">233815212529_10152316612422530 =
&nbsp;233815212529_10152316612422530 &nbsp;1328569332 &nbsp; &nbsp; =
&nbsp;1404691200 &nbsp; &nbsp; &nbsp;1404691200 &nbsp; &nbsp; =
&nbsp;1402316275 &nbsp; &nbsp; &nbsp;46 &nbsp; &nbsp; &nbsp;0 &nbsp; =
&nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp; 7 &nbsp; &nbsp; &nbsp; 0 &nbsp; =
&nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp; 0</font></div><div><font =
face=3D"Courier New" size=3D"1">233815212529_10152316612422530 =
&nbsp;233815212529_10152316612422530 &nbsp;1328569332 &nbsp; &nbsp; =
&nbsp;1404694800 &nbsp; &nbsp; &nbsp;1404694800 &nbsp; &nbsp; =
&nbsp;1402316275 &nbsp; &nbsp; &nbsp;46 &nbsp; &nbsp; &nbsp;0 &nbsp; =
&nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp; 7 &nbsp; &nbsp; &nbsp; 0 &nbsp; =
&nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp; 0</font></div><div><font =
face=3D"Courier New" size=3D"1">233815212529_10152316612422530 =
&nbsp;233815212529_10152316612422530 &nbsp;1328569332 &nbsp; &nbsp; =
&nbsp;1404698400 &nbsp; &nbsp; &nbsp;1404698400 &nbsp; &nbsp; =
&nbsp;1402316275 &nbsp; &nbsp; &nbsp;46 &nbsp; &nbsp; &nbsp;0 &nbsp; =
&nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp; 7 &nbsp; &nbsp; &nbsp; 0 &nbsp; =
&nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp; 0</font></div><div><font =
face=3D"Courier New" size=3D"1">233815212529_10152316612422530 =
&nbsp;233815212529_10152316612422530 &nbsp;1328569332 &nbsp; &nbsp; =
&nbsp;1404702000 &nbsp; &nbsp; &nbsp;1404702000 &nbsp; &nbsp; =
&nbsp;1402316275 &nbsp; &nbsp; &nbsp;46 &nbsp; &nbsp; &nbsp;0 &nbsp; =
&nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp; 7 &nbsp; &nbsp; &nbsp; 0 &nbsp; =
&nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp; =
0</font></div></div><div><br></div><div>Using the standard&nbsp;<span =
style=3D"font-family: Monaco; font-size: =
11px;">saveAsTextFile</span>&nbsp;with an optional codec =
(GzipCodec)</div><div><br></div><div><div style=3D"margin: 0px; =
font-size: 11px; font-family: Monaco; color: rgb(57, 51, 255);"><span =
style=3D"">&nbsp; &nbsp;&nbsp;</span><span style=3D"color: =
#727aff">postsStats</span><span style=3D"">.saveAsTextFile(s</span>"<a =
href=3D"s3n://smx-spark/...../raw_data">s3n://smx-spark/...../raw_data</a>=
"<span style=3D"">, classOf[GzipCodec])</span></div></div><div><span =
style=3D""><br></span></div><div>The resulting task is&nbsp;taking =
really long, i.e.: 3 hours to save 2Gb of data. I found some references =
and blog posts about to increase RDD partition to improve processing =
when READING from source.</div><div><br></div><div>The oposite operation =
would improve WRITE operation, I mean, if a reduce the partitioning =
level can I&nbsp;<a =
href=3D"http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-=
small-files-problem/">avoid small file problem</a>?</div><div>Is it =
possible that GzipCodec affecting parallelism level and reducing the =
overall performance?</div><div><br></div><div>I have<b> 4 nodes =
m1.xlarge</b> (1 master + 3 workers) on <b>EC2 - standalone mode</b> =
launched using&nbsp;<span style=3D"color: rgb(68, 68, 68); font-family: =
Menlo, 'Lucida Console', monospace; line-height: 19px; background-color: =
rgb(255, 255, 255);">spark-ec2</span>script with version <b>Spark =
1.1.0</b></div><div><br></div><div>Thanks a lot!</div><div>- =
gustavo</div></div></blockquote></div><br></body></html>=

--Apple-Mail=_81228049-48CE-49DF-BB08-E53952E7C982--