Hi,
I found the answer to my problem, and just writing to keep it as KB.
Turns out the problem wasn’t related to S3 performance, it was due my SOURCE was not fast
enough, due the lazy nature of Spark what I saw on the dashboard was saveAsTextFile at FacebookProcessor.scala:46
instead of the load method()
When I ran count() on my dataset before trying to save it to S3 I could figure out the input
bottleneck.
- gustavo
On Sep 30, 2014, at 10:03 PM, Gustavo Arjones <garjones@socialmetrix.com> wrote:
> Hi,
> I’m trying to save about a million of lines containing statistics data, something like:
>
> 233815212529_10152316612422530 233815212529_10152316612422530 1328569332 1404691200
1404691200 1402316275 46 0 0 7 0 0 0
> 233815212529_10152316612422530 233815212529_10152316612422530 1328569332 1404694800
1404694800 1402316275 46 0 0 7 0 0 0
> 233815212529_10152316612422530 233815212529_10152316612422530 1328569332 1404698400
1404698400 1402316275 46 0 0 7 0 0 0
> 233815212529_10152316612422530 233815212529_10152316612422530 1328569332 1404702000
1404702000 1402316275 46 0 0 7 0 0 0
>
> Using the standard saveAsTextFile with an optional codec (GzipCodec)
>
> postsStats.saveAsTextFile(s"s3n://smx-spark/...../raw_data", classOf[GzipCodec])
>
> The resulting task is taking really long, i.e.: 3 hours to save 2Gb of data. I found
some references and blog posts about to increase RDD partition to improve processing when
READING from source.
>
> The oposite operation would improve WRITE operation, I mean, if a reduce the partitioning
level can I avoid small file problem?
> Is it possible that GzipCodec affecting parallelism level and reducing the overall performance?
>
> I have 4 nodes m1.xlarge (1 master + 3 workers) on EC2 - standalone mode launched using
spark-ec2script with version Spark 1.1.0
>
> Thanks a lot!
> - gustavo
|