Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: <CAJA+z_SH1AedTkgsJ=Xvrh0mrAowpuwh4vG-DE69k=CCA6ezpQ@mail.gmail.com>
References: <CAJA+z_RZPv5-64pgGmwg+S4N7+-zE0CxAMvCD_apdonw3yM7Uw@mail.gmail.com>
 <CAMs9kVgHH+yOGhHwmmOs7qrUkJpOOqh1qPPmWDZUXprthhac+A@mail.gmail.com> <CAJA+z_SH1AedTkgsJ=Xvrh0mrAowpuwh4vG-DE69k=CCA6ezpQ@mail.gmail.com>
From: Ravi Prakash <ravihadoop@gmail.com>
Date: Thu, 3 Nov 2016 15:22:10 -0700
Message-ID: <CAMs9kVgkcccKsYMTg-79OfY=45q5WaGWP1zzdvWUESivLMMZhQ@mail.gmail.com>
Subject: Re: why the default value of 'yarn.resourcemanager.container.liveness-monitor.interval-ms'
 in yarn-default.xml is so high?
To: Tanvir Rahman <tanvir9982000@gmail.com>
Cc: user <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=001a113f821a27533305406cfd0b
archived-at: Thu, 03 Nov 2016 22:22:18 -0000

--001a113f821a27533305406cfd0b
Content-Type: text/plain; charset=UTF-8

Hi Tanvir!

Although an application may request for that node, a container won't be
scheduled until the nodemanager sends a heartbeat. If the application
hasn't specified a preference for that node, then whichever node heartbeats
next, will be used to launch a container.

HTH
Ravi

On Thu, Nov 3, 2016 at 12:12 PM, Tanvir Rahman <tanvir9982000@gmail.com>
wrote:

> Thank you Ravi for your reply.
> I found one parameter 'yarn.resourcemanager.nm.
> liveness-monitor.interval-ms' (default value=1000ms) in yarn-default.xml
> (v2.4.1) which determines how often to check that node managers are still
> alive. So RM is checking heartbeat of NM every second but it takes 10 min
> to decide whether the NM is dead or not. (yarn.nm.liveness-monitor.
> expiry-interval-ms: How long to wait until a node manager is considered
> dead; default value = 600000 ms).
>
> What happens if RM finds that one NM's heartbeat is missing but it is not
> 10 min yet (yarn.nm.liveness-monitor.expiry-interval-ms time is not
> expired yet)
> Will a new application still make container request to that NM via RM?
>
> Thanks
> Tanvir
>
>
>
>
>
> On Wed, Nov 2, 2016 at 5:41 PM, Ravi Prakash <ravihadoop@gmail.com> wrote:
>
>> Hi Tanvir!
>>
>> Its hard to have some configuration that works for all cluster scenarios.
>> I suspect that value was chosen as somewhat a mirror of the time it takes
>> HDFS to realize a datanode is dead (which is also 10 mins from what I
>> remember). The RM also has to reschedule the work when that timeout
>> expires. Also there may be network glitches which could last that
>> long...... Also, the NMs are pretty stable by themselves. Failing NMs have
>> not been too common in my experience.
>>
>> HTH
>> Ravi
>>
>> On Wed, Nov 2, 2016 at 10:44 AM, Tanvir Rahman <tanvir9982000@gmail.com>
>> wrote:
>>
>>> Hello,
>>> Can anyone please tell me why the default value of '
>>> yarn.resourcemanager.container.liveness-monitor.interval-ms' in
>>> yarn-default.xml
>>> <https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml> is
>>> so high? This parameter determines "How often to check that containers
>>> are still alive". The default value is 60000 ms or 10 minutes. So if a
>>> node manager fails, the resource manager detects the dead container after
>>> 10 minutes.
>>>
>>>
>>> I am running a wordcount code in my university cluster. In the middle of
>>> run, I stopped node manager of one node (the data node is still running)
>>> and found that the completion time increases about 10 minutes because of
>>> the node manager failure.
>>>
>>> Thanks in advance
>>> Tanvir
>>>
>>>
>>>>
>>>
>>
>

--001a113f821a27533305406cfd0b
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div>Hi Tanvir!<br><br>Although an application may re=
quest for that node, a container won&#39;t be scheduled until the nodemanag=
er sends a heartbeat. If the application hasn&#39;t specified a preference =
for that node, then whichever node heartbeats next, will be used to launch =
a container.<br><br></div>HTH<br></div>Ravi<br></div><div class=3D"gmail_ex=
tra"><br><div class=3D"gmail_quote">On Thu, Nov 3, 2016 at 12:12 PM, Tanvir=
 Rahman <span dir=3D"ltr">&lt;<a href=3D"mailto:tanvir9982000@gmail.com" ta=
rget=3D"_blank">tanvir9982000@gmail.com</a>&gt;</span> wrote:<br><blockquot=
e class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc sol=
id;padding-left:1ex"><div dir=3D"ltr"><font face=3D"arial, helvetica, sans-=
serif" color=3D"#000000">Thank you Ravi for your reply.</font><div><font co=
lor=3D"#000000"><font face=3D"arial, helvetica, sans-serif">I found one par=
ameter </font><font face=3D"monospace, monospace">&#39;yarn.resourcemanager=
.nm.<wbr>liveness-monitor.interval-ms</font><font face=3D"arial, helvetica,=
 sans-serif">&#39; (default value=3D1000ms) in yarn-default.xml (v2.4.1) wh=
ich determines how often to check that node managers are still alive. So RM=
 is checking heartbeat of NM every second but it takes 10 min to decide whe=
ther the NM is dead or not. (</font><a name=3D"m_6242129437795673186_yarn.n=
m.liveness-monitor.expiry-interval-ms"><font face=3D"monospace, monospace">=
yarn.nm.liveness-monitor.<wbr>expiry-interval-ms</font><font face=3D"arial,=
 helvetica, sans-serif">:=C2=A0</font></a><font face=3D"arial, helvetica, s=
ans-serif">How long to wait until a node manager is considered dead;</font>=
<a name=3D"m_6242129437795673186_yarn.nm.liveness-monitor.expiry-interval-m=
s" style=3D"font-family:arial,helvetica,sans-serif">=C2=A0<font color=3D"#0=
00000">default value =3D 600000 ms</font></a><font face=3D"arial, helvetica=
, sans-serif">).</font></font></div><div><font color=3D"#000000"><font face=
=3D"arial, helvetica, sans-serif"><br></font></font></div><div><font color=
=3D"#000000"><font face=3D"arial, helvetica, sans-serif">What happens if RM=
 finds that one NM&#39;s heartbeat is missing but it is not 10 min yet (</f=
ont><a name=3D"m_6242129437795673186_yarn.nm.liveness-monitor.expiry-interv=
al-ms"><font face=3D"monospace, monospace">yarn.nm.liveness-monitor.<wbr>ex=
piry-interval-ms</font><font face=3D"arial, helvetica, sans-serif"> <font c=
olor=3D"#000000">time is not expired yet)</font></font></a></font></div><di=
v><a name=3D"m_6242129437795673186_yarn.nm.liveness-monitor.expiry-interval=
-ms"><font face=3D"arial, helvetica, sans-serif" color=3D"#000000">Will a n=
ew application still make container request to that NM via RM?</font></a></=
div><div><a name=3D"m_6242129437795673186_yarn.nm.liveness-monitor.expiry-i=
nterval-ms"><font face=3D"arial, helvetica, sans-serif" color=3D"#000000"><=
br></font></a></div><div><a name=3D"m_6242129437795673186_yarn.nm.liveness-=
monitor.expiry-interval-ms"><font face=3D"arial, helvetica, sans-serif" col=
or=3D"#000000">Thanks</font></a></div><span class=3D"HOEnZb"><font color=3D=
"#888888"><div><a name=3D"m_6242129437795673186_yarn.nm.liveness-monitor.ex=
piry-interval-ms"><font face=3D"arial, helvetica, sans-serif" color=3D"#000=
000">Tanvir</font></a></div><div><a name=3D"m_6242129437795673186_yarn.nm.l=
iveness-monitor.expiry-interval-ms" style=3D"color:rgb(0,0,0);font-family:&=
quot;times new roman&quot;;font-size:medium"><br></a></div><div><a name=3D"=
m_6242129437795673186_yarn.nm.liveness-monitor.expiry-interval-ms" style=3D=
"color:rgb(0,0,0);font-family:&quot;times new roman&quot;;font-size:medium"=
><br></a></div><div><span style=3D"color:rgb(102,102,102);font-family:verda=
na,sans-serif;font-size:12px"><br></span></div><div><span style=3D"color:rg=
b(102,102,102);font-family:verdana,sans-serif;font-size:12px">=C2=A0</span>=
</div></font></span></div><div class=3D"HOEnZb"><div class=3D"h5"><div clas=
s=3D"gmail_extra"><br><div class=3D"gmail_quote">On Wed, Nov 2, 2016 at 5:4=
1 PM, Ravi Prakash <span dir=3D"ltr">&lt;<a href=3D"mailto:ravihadoop@gmail=
.com" target=3D"_blank">ravihadoop@gmail.com</a>&gt;</span> wrote:<br><bloc=
kquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #cc=
c solid;padding-left:1ex"><div dir=3D"ltr"><div><div><div><div>Hi Tanvir!<b=
r><br></div>Its hard to have some configuration that works for all cluster =
scenarios. I suspect that value was chosen as somewhat a mirror of the time=
 it takes HDFS to realize a datanode is dead (which is also 10 mins from wh=
at I remember). The RM also has to reschedule the work when that timeout ex=
pires. Also there may be network glitches which could last that long...... =
Also, the NMs are pretty stable by themselves. Failing NMs have not been to=
o common in my experience.<br></div><br></div>HTH<span class=3D"m_624212943=
7795673186HOEnZb"><font color=3D"#888888"><br></font></span></div><span cla=
ss=3D"m_6242129437795673186HOEnZb"><font color=3D"#888888">Ravi<br></font><=
/span></div><div class=3D"m_6242129437795673186HOEnZb"><div class=3D"m_6242=
129437795673186h5"><div class=3D"gmail_extra"><br><div class=3D"gmail_quote=
">On Wed, Nov 2, 2016 at 10:44 AM, Tanvir Rahman <span dir=3D"ltr">&lt;<a h=
ref=3D"mailto:tanvir9982000@gmail.com" target=3D"_blank">tanvir9982000@gmai=
l.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"m=
argin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"l=
tr"><div class=3D"gmail_extra"><div class=3D"gmail_quote"><span style=3D"fo=
nt-size:12.8px">Hello,</span><div style=3D"font-size:12.8px">Can anyone ple=
ase tell me why the default value of &#39;<font face=3D"monospace, monospac=
e" color=3D"#000000">yarn.resourcemanager.containe<wbr>r.liveness-monitor.i=
nterval-ms</font><wbr>&#39; in=C2=A0<a href=3D"https://hadoop.apache.org/do=
cs/r2.4.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml" target=3D"_blank=
">yarn-default.xml</a>=C2=A0is so high?=C2=A0<font face=3D"arial, helvetica=
, sans-serif">This parameter determines &quot;<span style=3D"color:rgb(0,0,=
0)">How often to check that containers are still alive&quot;. The default v=
alue</span>=C2=A0is 60000 ms or 10 minutes. So if a node manager fails, the=
 resource manager detects the dead container after 10 minutes.</font></div>=
<div style=3D"font-size:12.8px"><br></div><div style=3D"font-size:12.8px">=
=C2=A0</div><div style=3D"font-size:12.8px">I am running a wordcount code i=
n my university cluster. In the middle of run, I stopped node manager of on=
e node (the data node is still running) and found that the completion time =
increases about 10 minutes because of the node manager failure.=C2=A0</div>=
<div style=3D"font-size:12.8px"><br></div><div style=3D"font-size:12.8px">T=
hanks in advance<img class=3D"m_6242129437795673186m_565223617836359864m_29=
03454108955608208gmail-m_5755401853903984868gmail-ajT" src=3D"https://ssl.g=
static.com/ui/v1/icons/mail/images/cleardot.gif" style=3D"font-size:12.8px"=
></div><span class=3D"m_6242129437795673186m_565223617836359864m_2903454108=
955608208gmail-HOEnZb"><font color=3D"#888888"><span class=3D"m_62421294377=
95673186m_565223617836359864m_2903454108955608208gmail-m_575540185390398486=
8gmail-HOEnZb m_6242129437795673186m_565223617836359864m_290345410895560820=
8gmail-m_5755401853903984868gmail-adL" style=3D"font-size:12.8px"><font col=
or=3D"#888888"><div style=3D"font-size:12.8px">Tanvir</div><div style=3D"fo=
nt-size:12.8px"><br></div></font></span></font></span><blockquote class=3D"=
gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(20=
4,204,204);padding-left:1ex"><div dir=3D"ltr"><br></div>
</blockquote></div><br></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--001a113f821a27533305406cfd0b--