spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Wendell <>
Subject Re: How fast would you expect shuffle serialize to be?
Date Wed, 30 Apr 2014 04:56:47 GMT
Is this the serialization throughput per task or the serialization
throughput for all the tasks?

On Tue, Apr 29, 2014 at 9:34 PM, Liu, Raymond <> wrote:
> Hi
>         I am running a WordCount program which count words from HDFS, and I noticed that
the serializer part of code takes a lot of CPU time. On a 16core/32thread node, the total
throughput is around 50MB/s by JavaSerializer, and if I switching to KryoSerializer, it doubles
to around 100-150MB/s. ( I have 12 disks per node and files scatter across disks, so HDFS
BW is not a problem)
>         And I also notice that, in this case, the object to write is (String, Int), if
I try some case with (int, int), the throughput will be 2-3x faster further.
>         So, in my Wordcount case, the bottleneck is CPU ( cause if with shuffle compress
on, the 150MB/s data bandwidth in input side, will usually lead to around 50MB/s shuffle data)
>         This serialize BW looks somehow too low , so I am wondering, what's BW you observe
in your case? Does this throughput sounds reasonable to you? If not, anything might possible
need to be examined in my case?
> Best Regards,
> Raymond Liu

View raw message