Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B1E5D1899D for ; Wed, 28 Oct 2015 01:29:53 +0000 (UTC) Received: (qmail 44062 invoked by uid 500); 28 Oct 2015 01:29:50 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 43966 invoked by uid 500); 28 Oct 2015 01:29:50 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 43956 invoked by uid 99); 28 Oct 2015 01:29:50 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Oct 2015 01:29:50 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id E254CC09DF for ; Wed, 28 Oct 2015 01:29:49 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 5.107 X-Spam-Level: ***** X-Spam-Status: No, score=5.107 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, FROM_LOCAL_NOVOWEL=0.5, HK_RANDOM_ENVFROM=0.626, HK_RANDOM_FROM=1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=udel_edu.20150623.gappssmtp.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id xF7aZbnp9T12 for ; Wed, 28 Oct 2015 01:29:40 +0000 (UTC) Received: from mail-wi0-f176.google.com (mail-wi0-f176.google.com [209.85.212.176]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 5D4EC212A3 for ; Wed, 28 Oct 2015 01:29:40 +0000 (UTC) Received: by wicfv8 with SMTP id fv8so185408643wic.0 for ; Tue, 27 Oct 2015 18:29:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=udel_edu.20150623.gappssmtp.com; s=20150623; h=mime-version:date:message-id:subject:from:to:content-type; bh=bKj85Z9djqd5yxOtZl/j1tty1CnLlUB1dAonGqOa5DA=; b=JK3AOIlsp1ghLLDhsdfGOJw/hFu3buL/BjJvGFtCMVoyW38G4QMBa7ncdn/BBcAaM3 GJGAcQ9mi7nf48N/i3XNoH1S3NbNj/LkDHlj60WD7OdEMbm/NxmCrQy3lJKlL5cCSiog j2XoxvXVtY3U6oxfpddnKFgpDnw04yZ9f8hVAyzQIppT4LM5FbMosjzX68PD3TAYpxHk +tVO3kOFzMSciRa7NwgJOFHBMGs7VyKuHSQ/CjUFDDJTv3we6kg0hpnpOXS/908L6LfQ kxIWXKlQ+KpIYmEAA2kXtDTGeSgYvcqwDhlxdTnf4NTJTGughvIj1qSDubyHFSPUM/D8 /Gpg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:date:message-id:subject:from:to :content-type; bh=bKj85Z9djqd5yxOtZl/j1tty1CnLlUB1dAonGqOa5DA=; b=Fwb8OmVvRNy+51SZEGi9h/K/NYsmRCRsXEw7AyURyQKEPD3ajOHK02jLTnDkvORE2/ RWxSEj3oT/rC/G0YWcrXM/ZHqURBitJWc+FD2+hRKFO/3M8LycxNekDAFgd4F1s5ZN6K 4tbuJrhtHLyoDCjsbFQnIELkZFRELRQQyJnbamx6CA4kgOEfoEIYn8+5Ql5gvzj4UzRh g0qN16dmi4O1rlRFvFZffvSptrcj0KfCy3C+ro4i5i7i5eHeEiZFM0kO5Hb5jMzEO07x iQsJc+Gj0i8yCZVNKsYi/joic/hp5cBZxY/Ve73okZFadyJTT0VCKGwy3090VEI7wUhJ ogJg== X-Gm-Message-State: ALoCoQnfgCRQgo+cu9/A9UVqYdH+h9y9rqsleFkJCjXSTpxBEpaOte4ho+ylUJpoB+7DnjN6BNbp MIME-Version: 1.0 X-Received: by 10.180.205.198 with SMTP id li6mr155481wic.63.1445995779024; Tue, 27 Oct 2015 18:29:39 -0700 (PDT) Received: by 10.194.58.71 with HTTP; Tue, 27 Oct 2015 18:29:38 -0700 (PDT) Date: Tue, 27 Oct 2015 21:29:38 -0400 Message-ID: Subject: python.worker.memory parameter From: Connor Zanin To: user Content-Type: multipart/alternative; boundary=001a11c38720c18978052320202f --001a11c38720c18978052320202f Content-Type: text/plain; charset=UTF-8 Hi all, I am running a simple word count job on a cluster of 4 nodes (24 cores per node). I am varying two parameter in the configuration, spark.python.worker.memory and the number of partitions in the RDD. My job is written in python. I am observing a discontinuity in the run time of the job when the spark.python.worker.memory is increased past a threshold. Unfortunately, I am having trouble understanding exactly what this parameter is doing to Spark internally and how it changes Spark's behavior to create this discontinuity. The documentation describes this parameter as "Amount of memory to use per python worker process during aggregation," but I find this is vague (or I do not know enough Spark terminology to know what it means). I have been pointed to the source code in the past, specifically the shuffle.py file where _spill() appears. Can anyone explain how this parameter behaves or point me to more descriptive documentation? Thanks! -- Regards, Connor Zanin Computer Science University of Delaware --001a11c38720c18978052320202f Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi all,

I am runnin= g a simple word count job on a cluster of 4 nodes (24 cores per node). I am= varying two parameter in the configuration, spark.python.worker.memory and= the number of partitions in the RDD. My job is written in python.

I am= observing a discontinuity in the run time of the job when the spark.python= .worker.memory is increased past a threshold. Unfortunately, I am having tr= ouble understanding exactly what this parameter is doing to Spark internall= y and how it changes Spark's behavior to create this discontinuity.

The documentation describes this parameter as "Amount of memory to use per python worke= r process during aggregation," but I find this is vague (or I do not k= now enough Spark terminology to know what it means).
I have been pointed to the source code in the= past, specifically the shuffle.py file where _spill() appears.

= Can anyone explain how this= parameter behaves or point me to more descriptive documentation? Thanks!

--
Regards,

Connor Zanin
Computer Sc= ience
University of Delaware
--001a11c38720c18978052320202f--