spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gustavo Arjones <garjo...@socialmetrix.com>
Subject Re: Poor performance writing to S3
Date Wed, 01 Oct 2014 18:39:32 GMT
Hi,
I found the answer to my problem, and just writing to keep it as KB.

Turns out the problem wasn’t related to S3 performance, it was due my SOURCE was not fast
enough, due the lazy nature of Spark what I saw on the dashboard was saveAsTextFile at FacebookProcessor.scala:46
instead of the load method()

When I ran count() on my dataset before trying to save it to S3 I could figure out the input
bottleneck.

- gustavo


On Sep 30, 2014, at 10:03 PM, Gustavo Arjones <garjones@socialmetrix.com> wrote:

> Hi,
> I’m trying to save about a million of lines containing statistics data, something like:
> 
> 233815212529_10152316612422530  233815212529_10152316612422530  1328569332      1404691200
     1404691200      1402316275      46      0       0       7       0       0       0
> 233815212529_10152316612422530  233815212529_10152316612422530  1328569332      1404694800
     1404694800      1402316275      46      0       0       7       0       0       0
> 233815212529_10152316612422530  233815212529_10152316612422530  1328569332      1404698400
     1404698400      1402316275      46      0       0       7       0       0       0
> 233815212529_10152316612422530  233815212529_10152316612422530  1328569332      1404702000
     1404702000      1402316275      46      0       0       7       0       0       0
> 
> Using the standard saveAsTextFile with an optional codec (GzipCodec)
> 
>     postsStats.saveAsTextFile(s"s3n://smx-spark/...../raw_data", classOf[GzipCodec])
> 
> The resulting task is taking really long, i.e.: 3 hours to save 2Gb of data. I found
some references and blog posts about to increase RDD partition to improve processing when
READING from source.
> 
> The oposite operation would improve WRITE operation, I mean, if a reduce the partitioning
level can I avoid small file problem?
> Is it possible that GzipCodec affecting parallelism level and reducing the overall performance?
> 
> I have 4 nodes m1.xlarge (1 master + 3 workers) on EC2 - standalone mode launched using
spark-ec2script with version Spark 1.1.0
> 
> Thanks a lot!
> - gustavo


Mime
View raw message