spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andre Kuhnen <andrekuh...@gmail.com>
Subject Spark slow after first stage (big dataset)
Date Fri, 01 Nov 2013 10:52:22 GMT
Hello,  I am trying to understand how to solve a performance problem with
my spark job.


Here is the algorithm

RDD.flatMap( func(createSomeObject)).distinct.reduceByKey(...)....


When the data set is big enough to create around 2000 tasks, when it gets
to the second stage (group by key)   the CPU of ALL machines spend most of
the time waiting, and besides that I could see that there is a constant
flow on the NetWork NetWork.

I am using the ec2  script to deploy a cluster on AMAZON.


The strange think is that when I have less than 300 task  the reduceByKey
stage is faster than the first stage (distinct),  so  I am trying to figure
out  why it gets really slow (I mean  slower than the first stage
(distinct)), with huge data set.

Thanks  a lot

Mime
View raw message