Mailing-List: contact user-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: 
 <CALxMP-BCS6VOY1NrYH42irRAMx97Em=17BoEFt8m2L+gut=i_w@mail.gmail.com>
References: 
 <CAKuc-m6R8SHHB+dhvAO_t4kmR9frUOY=ViA8GDo63vXLQABuSw@mail.gmail.com>
	<CA+DQVOWcSP7mTEsqmNWFeKeYbDFsex_d8NDMfXRQpOOSsGOZ0Q@mail.gmail.com>
	<CAKuc-m5QfMo-jeZmWEguNkG_egdoUt4Bu2tgn-Q8XhyvRkQZ2g@mail.gmail.com>
	<CAC2U_Pe_Zy58eF+aQvyYDF=8LHkhSXY2PE7gp7pri7pvWTfD0A@mail.gmail.com>
	<CALxMP-A+aygNwSiyTM8ff20-MGWHykbhct94a2hwZTh1jWHp_g@mail.gmail.com>
	<CALxMP-BCS6VOY1NrYH42irRAMx97Em=17BoEFt8m2L+gut=i_w@mail.gmail.com>
Date: Sat, 17 Oct 2015 11:17:38 +0530
Message-ID: 
 <CALxMP-CUoC+BfsRsoDEcqAnCkBJqbosFx1fxMMSqGVYjUfZCxw@mail.gmail.com>
Subject: Re: Spark on Mesos / Executor Memory
From: Bharath Ravi Kumar <reachbach@gmail.com>
To: Tim Chen <tim@mesosphere.io>, user <user@spark.apache.org>
Cc: user@mesos.apache.org
Content-Type: multipart/alternative; boundary=001a113aa3c42bd6e405224673e0

--001a113aa3c42bd6e405224673e0
Content-Type: text/plain; charset=UTF-8

Can someone respond if you're aware of the reason for such a memory
footprint? It seems unintuitive and hard to reason about.

Thanks,
Bharath

On Thu, Oct 15, 2015 at 12:29 PM, Bharath Ravi Kumar <reachbach@gmail.com>
wrote:

> Resending since user@mesos bounced earlier. My apologies.
>
> On Thu, Oct 15, 2015 at 12:19 PM, Bharath Ravi Kumar <reachbach@gmail.com>
> wrote:
>
>> (Reviving this thread since I ran into similar issues...)
>>
>> I'm running two spark jobs (in mesos fine grained mode), each belonging
>> to a different mesos role, say low and high. The low:high mesos weights are
>> 1:10. On expected lines, I see that the low priority job occupies cluster
>> resources to the maximum extent when running alone. However, when the high
>> priority job is submitted, it does not start and continues to await cluster
>> resources (as seen in the logs). Since the jobs run in fine grained mode
>> and the low priority tasks begin to finish, the high priority job should
>> ideally be able to start and gradually take over cluster resources as per
>> the weights. However, I noticed that while the "low" job gives up CPU cores
>> with each completing task (e.g. reduction from 72 -> 12 with default
>> parallelism set to 72), the memory resources are held on (~500G out of
>> 768G). The spark.executor.memory setting appears to directly impact the
>> amount of memory that the job holds on to. In this case, it was set to 200G
>> in the low priority task and 100G in the high priority task. The nature of
>> these jobs is such that setting the numbers to smaller values (say 32g)
>> resulted in job failures with outofmemoryerror.  It appears that the spark
>> framework is retaining memory (across tasks)  proportional to
>> spark.executor.memory for the duration of the job and not releasing memory
>> as tasks complete. This defeats the purpose of fine grained mode execution
>> as the memory occupancy is preventing the high priority job from accepting
>> the prioritized cpu offers and beginning execution. Can this be explained /
>> documented better please?
>>
>> Thanks,
>> Bharath
>>
>> On Sat, Apr 11, 2015 at 10:59 PM, Tim Chen <tim@mesosphere.io> wrote:
>>
>>> (Adding spark user list)
>>>
>>> Hi Tom,
>>>
>>> If I understand correctly you're saying that you're running into memory
>>> problems because the scheduler is allocating too much CPUs and not enough
>>> memory to acoomodate them right?
>>>
>>> In the case of fine grain mode I don't think that's a problem since we
>>> have a fixed amount of CPU and memory per task.
>>> However, in coarse grain you can run into that problem if you're with in
>>> the spark.cores.max limit, and memory is a fixed number.
>>>
>>> I have a patch out to configure how much max cpus should coarse grain
>>> executor use, and it also allows multiple executors in coarse grain mode.
>>> So you could say try to launch multiples of max 4 cores with
>>> spark.executor.memory (+ overhead and etc) in a slave. (
>>> https://github.com/apache/spark/pull/4027)
>>>
>>> It also might be interesting to include a cores to memory multiplier so
>>> that with a larger amount of cores we try to scale the memory with some
>>> factor, but I'm not entirely sure that's intuitive to use and what people
>>> know what to set it to, as that can likely change with different workload.
>>>
>>> Tim
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sat, Apr 11, 2015 at 9:51 AM, Tom Arnfeld <tom@duedil.com> wrote:
>>>
>>>> We're running Spark 1.3.0 (with a couple of patches over the top for
>>>> docker related bits).
>>>>
>>>> I don't think SPARK-4158 is related to what we're seeing, things do run
>>>> fine on the cluster, given a ridiculously large executor memory
>>>> configuration. As for SPARK-3535 although that looks useful I think we'e
>>>> seeing something else.
>>>>
>>>> Put a different way, the amount of memory required at any given time by
>>>> the spark JVM process is directly proportional to the amount of CPU it has,
>>>> because more CPU means more tasks and more tasks means more memory. Even if
>>>> we're using coarse mode, the amount of executor memory should be
>>>> proportionate to the amount of CPUs in the offer.
>>>>
>>>> On 11 April 2015 at 17:39, Brenden Matthews <brenden@diddyinc.com>
>>>> wrote:
>>>>
>>>>> I ran into some issues with it a while ago, and submitted a couple PRs
>>>>> to fix it:
>>>>>
>>>>> https://github.com/apache/spark/pull/2401
>>>>> https://github.com/apache/spark/pull/3024
>>>>>
>>>>> Do these look relevant? What version of Spark are you running?
>>>>>
>>>>> On Sat, Apr 11, 2015 at 9:33 AM, Tom Arnfeld <tom@duedil.com> wrote:
>>>>>
>>>>>> Hey,
>>>>>>
>>>>>> Not sure whether it's best to ask this on the spark mailing list or
>>>>>> the mesos one, so I'll try here first :-)
>>>>>>
>>>>>> I'm having a bit of trouble with out of memory errors in my spark
>>>>>> jobs... it seems fairly odd to me that memory resources can only be set at
>>>>>> the executor level, and not also at the task level. For example, as far as
>>>>>> I can tell there's only a *spark.executor.memory* config option.
>>>>>>
>>>>>> Surely the memory requirements of a single executor are quite
>>>>>> dramatically influenced by the number of concurrent tasks running? Given a
>>>>>> shared cluster, I have no idea what % of an individual slave my executor is
>>>>>> going to get, so I basically have to set the executor memory to a value
>>>>>> that's correct when the whole machine is in use...
>>>>>>
>>>>>> Has anyone else running Spark on Mesos come across this, or maybe
>>>>>> someone could correct my understanding of the config options?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> Tom.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

--001a113aa3c42bd6e405224673e0
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div>Can someone respond if you&#39;re aware of the r=
eason for such a memory footprint? It seems unintuitive and hard to reason =
about. <br><br></div>Thanks,<br></div>Bharath<br></div><div class=3D"gmail_=
extra"><br><div class=3D"gmail_quote">On Thu, Oct 15, 2015 at 12:29 PM, Bha=
rath Ravi Kumar <span dir=3D"ltr">&lt;<a href=3D"mailto:reachbach@gmail.com=
" target=3D"_blank">reachbach@gmail.com</a>&gt;</span> wrote:<br><blockquot=
e class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc sol=
id;padding-left:1ex"><div dir=3D"ltr">Resending since user@mesos bounced ea=
rlier. My apologies.<br></div><div class=3D"HOEnZb"><div class=3D"h5"><div =
class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Thu, Oct 15, 2015 a=
t 12:19 PM, Bharath Ravi Kumar <span dir=3D"ltr">&lt;<a href=3D"mailto:reac=
hbach@gmail.com" target=3D"_blank">reachbach@gmail.com</a>&gt;</span> wrote=
:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-le=
ft:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div>(Reviving this th=
read since I ran into similar issues...)<br><br></div>I&#39;m running two s=
park jobs (in mesos fine grained mode), each belonging to a different mesos=
 role, say low and high. The low:high mesos weights are 1:10. On expected l=
ines, I see that the low priority job occupies cluster resources to the max=
imum extent when running alone. However, when the high priority job is subm=
itted, it does not start and continues to await cluster resources (as seen =
in the logs). Since the jobs run in fine grained mode and the low priority =
tasks begin to finish, the high priority job should ideally be able to star=
t and gradually take over cluster resources as per the weights. However, I =
noticed that while the &quot;low&quot; job gives up CPU cores with each com=
pleting task (e.g. reduction from 72 -&gt; 12 with default parallelism set =
to 72), the memory resources are held on (~500G out of 768G). The spark.exe=
cutor.memory setting appears to directly impact the amount of memory that t=
he job holds on to. In this case, it was set to 200G in the low priority ta=
sk and 100G in the high priority task. The nature of these jobs is such tha=
t setting the numbers to smaller values (say 32g) resulted in job failures =
with outofmemoryerror.=C2=A0 It appears that the spark framework is retaini=
ng memory (across tasks)=C2=A0 proportional to spark.executor.memory for th=
e duration of the job and not releasing memory as tasks complete. This defe=
ats the purpose of fine grained mode execution as the memory occupancy is p=
reventing the high priority job from accepting the prioritized cpu offers a=
nd beginning execution. Can this be explained / documented better please? <=
br><br>Thanks,<br>Bharath<br></div><div><div><div class=3D"gmail_extra"><br=
><div class=3D"gmail_quote">On Sat, Apr 11, 2015 at 10:59 PM, Tim Chen <spa=
n dir=3D"ltr">&lt;<a href=3D"mailto:tim@mesosphere.io" target=3D"_blank">ti=
m@mesosphere.io</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" =
style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><di=
v dir=3D"ltr">(Adding spark user list)<div><br></div><div>Hi Tom,</div><div=
><br></div><div>If I understand correctly you&#39;re saying that you&#39;re=
 running into memory problems because the scheduler is allocating too much =
CPUs and not enough memory to acoomodate them right?</div><div><br></div><d=
iv>In the case of fine grain mode I don&#39;t think that&#39;s a problem si=
nce we have a fixed amount of CPU and memory per task.=C2=A0</div><div>Howe=
ver, in coarse grain you can run into that problem if you&#39;re with in th=
e spark.cores.max limit, and memory is a fixed number.</div><div><br></div>=
<div>I have a patch out to configure how much max cpus should coarse grain =
executor use, and it also allows multiple executors in coarse grain mode. S=
o you could say try to launch multiples of max 4 cores with spark.executor.=
memory (+ overhead and etc) in a slave. (<a href=3D"https://github.com/apac=
he/spark/pull/4027" target=3D"_blank">https://github.com/apache/spark/pull/=
4027</a>)<br></div><div><br></div><div>It also might be interesting to incl=
ude a cores to memory multiplier so that with a larger amount of cores we t=
ry to scale the memory with some factor, but I&#39;m not entirely sure that=
&#39;s intuitive to use and what people know what to set it to, as that can=
 likely change with different workload.</div><div><br></div><div>Tim</div><=
div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><=
div><br></div></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quot=
e">On Sat, Apr 11, 2015 at 9:51 AM, Tom Arnfeld <span dir=3D"ltr">&lt;<a hr=
ef=3D"mailto:tom@duedil.com" target=3D"_blank">tom@duedil.com</a>&gt;</span=
> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bo=
rder-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">We&#39;re runni=
ng Spark 1.3.0 (with a couple of patches over the top for docker related bi=
ts).<div><br></div><div>I don&#39;t think SPARK-4158 is related to what we&=
#39;re seeing, things do run fine on the cluster, given a ridiculously larg=
e executor memory configuration. As for SPARK-3535 although that looks usef=
ul I think we&#39;e seeing something else.</div><div><br></div><div>Put a d=
ifferent way, the amount of memory required at any given time by the spark =
JVM process is directly proportional to the amount of CPU it has, because m=
ore CPU means more tasks and more tasks means more memory. Even if we&#39;r=
e using coarse mode, the amount of executor memory should be proportionate =
to the amount of CPUs in the offer.<br></div></div><div><div><div class=3D"=
gmail_extra"><br><div class=3D"gmail_quote">On 11 April 2015 at 17:39, Bren=
den Matthews <span dir=3D"ltr">&lt;<a href=3D"mailto:brenden@diddyinc.com" =
target=3D"_blank">brenden@diddyinc.com</a>&gt;</span> wrote:<br><blockquote=
 class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc soli=
d;padding-left:1ex"><div dir=3D"ltr">I ran into some issues with it a while=
 ago, and submitted a couple PRs to fix it:<div><br></div><div><a href=3D"h=
ttps://github.com/apache/spark/pull/2401" target=3D"_blank">https://github.=
com/apache/spark/pull/2401</a><br></div><div><a href=3D"https://github.com/=
apache/spark/pull/3024" target=3D"_blank">https://github.com/apache/spark/p=
ull/3024</a><br></div><div><br></div><div>Do these look relevant? What vers=
ion of Spark are you running?</div></div><div><div><div class=3D"gmail_extr=
a"><br><div class=3D"gmail_quote">On Sat, Apr 11, 2015 at 9:33 AM, Tom Arnf=
eld <span dir=3D"ltr">&lt;<a href=3D"mailto:tom@duedil.com" target=3D"_blan=
k">tom@duedil.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote=
" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><=
div dir=3D"ltr">


<p>Hey,</p><p>Not sure whether it&#39;s best to ask this on the spark maili=
ng list or the mesos one, so I&#39;ll try here first :-)</p><p>I&#39;m havi=
ng a bit of trouble with out of memory errors in my spark jobs... it seems =
fairly odd to me that memory resources can only be set at the executor leve=
l, and not also at the task level. For example, as far as I can tell there&=
#39;s only a <b>spark.executor.memory</b>=C2=A0config option.</p><p>Surely =
the memory requirements of a single executor are quite dramatically influen=
ced by the number of concurrent tasks running? Given a shared cluster, I ha=
ve no idea what % of an individual slave my executor is going to get, so I =
basically have to set the executor memory to a value that&#39;s correct whe=
n the whole machine is in use...</p><p>Has anyone else running Spark on Mes=
os come across this, or maybe someone could correct my understanding of the=
 config options?</p><p>Thanks!</p><span><font color=3D"#888888"><p>Tom.</p>=
</font></span></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--001a113aa3c42bd6e405224673e0--