Mailing-List: contact user-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
Date: Tue, 27 Oct 2015 21:29:38 -0400
Message-ID: 
 <CAJVXJFfVRJfvP9_uGu6QxAh-NYobpq4sAsYkhzu=Pa6o_YJ6Uw@mail.gmail.com>
Subject: python.worker.memory parameter
From: Connor Zanin <cnnrznn@udel.edu>
To: user <user@spark.apache.org>
Content-Type: multipart/alternative; boundary=001a11c38720c18978052320202f

--001a11c38720c18978052320202f
Content-Type: text/plain; charset=UTF-8

Hi all,

I am running a simple word count job on a cluster of 4 nodes (24 cores per
node). I am varying two parameter in the configuration,
spark.python.worker.memory and the number of partitions in the RDD. My job
is written in python.

I am observing a discontinuity in the run time of the job when the
spark.python.worker.memory is increased past a threshold. Unfortunately, I
am having trouble understanding exactly what this parameter is doing to
Spark internally and how it changes Spark's behavior to create this
discontinuity.

The documentation describes this parameter as "Amount of memory to use per
python worker process during aggregation," but I find this is vague (or I
do not know enough Spark terminology to know what it means).

I have been pointed to the source code in the past, specifically the
shuffle.py file where _spill() appears.

Can anyone explain how this parameter behaves or point me to more
descriptive documentation? Thanks!

-- 
Regards,

Connor Zanin
Computer Science
University of Delaware

--001a11c38720c18978052320202f
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><span style=3D"font-size:12.8px">Hi all,</span><div style=
=3D"font-size:12.8px"><br></div><div style=3D"font-size:12.8px">I am runnin=
g a simple word count job on a cluster of 4 nodes (24 cores per node). I am=
 varying two parameter in the configuration, spark.python.worker.memory and=
 the number of partitions in the RDD. My job is written in python.</div><di=
v style=3D"font-size:12.8px"><br></div><div style=3D"font-size:12.8px">I am=
 observing a discontinuity in the run time of the job when the spark.python=
.worker.memory is increased past a threshold. Unfortunately, I am having tr=
ouble understanding exactly what this parameter is doing to Spark internall=
y and how it changes Spark&#39;s behavior to create this discontinuity.</di=
v><div style=3D"font-size:12.8px"><br></div><div style=3D"font-size:12.8px"=
>The documentation describes this parameter as &quot;<span style=3D"color:r=
gb(51,51,51);font-family:&#39;Helvetica Neue&#39;,Helvetica,Arial,sans-seri=
f;font-size:14px;line-height:20px">Amount of memory to use per python worke=
r process during aggregation,&quot; but I find this is vague (or I do not k=
now enough Spark terminology to know what it means).</span></div><div style=
=3D"font-size:12.8px"><font color=3D"#333333" face=3D"Helvetica Neue, Helve=
tica, Arial, sans-serif"><span style=3D"font-size:14px;line-height:20px"><b=
r></span></font></div><div style=3D"font-size:12.8px"><font color=3D"#33333=
3" face=3D"Helvetica Neue, Helvetica, Arial, sans-serif"><span style=3D"fon=
t-size:14px;line-height:20px">I have been pointed to the source code in the=
 past, specifically the shuffle.py file where _spill() appears.</span></fon=
t></div><div style=3D"font-size:12.8px"><font color=3D"#333333" face=3D"Hel=
vetica Neue, Helvetica, Arial, sans-serif"><span style=3D"font-size:14px;li=
ne-height:20px"><br></span></font></div><div style=3D"font-size:12.8px"><fo=
nt color=3D"#333333" face=3D"Helvetica Neue, Helvetica, Arial, sans-serif">=
<span style=3D"font-size:14px;line-height:20px">Can anyone explain how this=
 parameter behaves or point me to more descriptive documentation? Thanks!</=
span></font></div><div><br></div>-- <br><div class=3D"gmail_signature"><div=
 dir=3D"ltr">Regards,<div><br></div><div>Connor Zanin</div><div>Computer Sc=
ience</div><div>University of Delaware</div></div></div>
</div>

--001a11c38720c18978052320202f--