spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Rosen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-3358) PySpark worker fork()ing performance regression in m3.* / PVM instances
Date Wed, 03 Sep 2014 00:57:52 GMT

    [ https://issues.apache.org/jira/browse/SPARK-3358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14119089#comment-14119089
] 

Josh Rosen commented on SPARK-3358:
-----------------------------------

Credit where it's due: Davies pointed out the potential for this problem in the original PR:
https://github.com/apache/spark/pull/1680#issuecomment-50721351

The Redis team did their own benchmarking on this (http://redislabs.com/blog/testing-fork-time-on-awsxen-infrastructure
(or https://web.archive.org/web/20140529181436/http://redislabs.com/blog/testing-fork-time-on-awsxen-infrastructure,
since their site may be down / slow right now)).

Based on those results, and updated numbers at http://redislabs.com/blog/benchmarking-the-new-aws-m3-instances-with-redis,
it looks like HVM AMIs don't have this problem.  I'm going to try running a similar microbenchmark
on m3.xlarge with the spark-ec2 HVM AMI to see if that improves performance.  If so, we should
consider changing from PVM to HVM for those instance types.

> PySpark worker fork()ing performance regression in m3.* / PVM instances
> -----------------------------------------------------------------------
>
>                 Key: SPARK-3358
>                 URL: https://issues.apache.org/jira/browse/SPARK-3358
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.1.0
>         Environment: m3.* instances on EC2
>            Reporter: Josh Rosen
>
> SPARK-2764 (and some followup commits) simplified PySpark's worker process structure
by removing an intermediate pool of processes forked by daemon.py.  Previously, daemon.py
forked a fixed-size pool of processes that shared a socket and handled worker launch requests
from Java.  After my patch, this intermediate pool was removed and launch requests are handled
directly in daemon.py.
> Unfortunately, this seems to have increased PySpark task launch latency when running
on m3* class instances in EC2.  Most of this difference can be attributed to m3 instances'
more expensive fork() system calls.  I tried the following microbenchmark on m3.xlarge and
r3.xlarge instances:
> {code}
> import os
> for x in range(1000):
>   if os.fork() == 0:
>     exit()
> {code}
> On the r3.xlarge instance:
> {code}
> real	0m0.761s
> user	0m0.008s
> sys	0m0.144s
> {code}
> And on m3.xlarge:
> {code}
> real    0m1.699s
> user    0m0.012s
> sys     0m1.008s
> {code}
> I think this is due to HVM vs PVM EC2 instances using different virtualization technologies
with different fork costs.
> It may be the case that this performance difference only appears in certain microbenchmarks
and is masked by other performance improvements in PySpark, such as improvements to large
group-bys.  I'm in the process of re-running spark-perf benchmarks on m3 instances in order
to confirm whether this impacts more realistic jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message