Return-Path: X-Original-To: apmail-spark-issues-archive@minotaur.apache.org Delivered-To: apmail-spark-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E90DA11274 for ; Wed, 3 Sep 2014 20:07:52 +0000 (UTC) Received: (qmail 7027 invoked by uid 500); 3 Sep 2014 20:07:52 -0000 Delivered-To: apmail-spark-issues-archive@spark.apache.org Received: (qmail 6875 invoked by uid 500); 3 Sep 2014 20:07:52 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 6865 invoked by uid 99); 3 Sep 2014 20:07:52 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Sep 2014 20:07:52 +0000 Date: Wed, 3 Sep 2014 20:07:52 +0000 (UTC) From: "Josh Rosen (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (SPARK-3358) PySpark worker fork()ing performance regression in m3.* / PVM instances MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SPARK-3358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14120354#comment-14120354 ] Josh Rosen commented on SPARK-3358: ----------------------------------- Agreed. Long term, I think it would be better to address the causes behind why we need to fork so many processes. > PySpark worker fork()ing performance regression in m3.* / PVM instances > ----------------------------------------------------------------------- > > Key: SPARK-3358 > URL: https://issues.apache.org/jira/browse/SPARK-3358 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.1.0 > Environment: m3.* instances on EC2 > Reporter: Josh Rosen > > SPARK-2764 (and some followup commits) simplified PySpark's worker process structure by removing an intermediate pool of processes forked by daemon.py. Previously, daemon.py forked a fixed-size pool of processes that shared a socket and handled worker launch requests from Java. After my patch, this intermediate pool was removed and launch requests are handled directly in daemon.py. > Unfortunately, this seems to have increased PySpark task launch latency when running on m3* class instances in EC2. Most of this difference can be attributed to m3 instances' more expensive fork() system calls. I tried the following microbenchmark on m3.xlarge and r3.xlarge instances: > {code} > import os > for x in range(1000): > if os.fork() == 0: > exit() > {code} > On the r3.xlarge instance: > {code} > real 0m0.761s > user 0m0.008s > sys 0m0.144s > {code} > And on m3.xlarge: > {code} > real 0m1.699s > user 0m0.012s > sys 0m1.008s > {code} > I think this is due to HVM vs PVM EC2 instances using different virtualization technologies with different fork costs. > It may be the case that this performance difference only appears in certain microbenchmarks and is masked by other performance improvements in PySpark, such as improvements to large group-bys. I'm in the process of re-running spark-perf benchmarks on m3 instances in order to confirm whether this impacts more realistic jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org