hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tanvir Rahman <tanvir9982...@gmail.com>
Subject Re: why the default value of 'yarn.resourcemanager.container.liveness-monitor.interval-ms' in yarn-default.xml is so high?
Date Thu, 03 Nov 2016 19:12:18 GMT
Thank you Ravi for your reply.
I found one parameter 'yarn.resourcemanager.nm.liveness-monitor.interval-ms'
(default value=1000ms) in yarn-default.xml (v2.4.1) which determines how
often to check that node managers are still alive. So RM is checking
heartbeat of NM every second but it takes 10 min to decide whether the NM
is dead or not. (yarn.nm.liveness-monitor.expiry-interval-ms: How long to
wait until a node manager is considered dead; default value = 600000 ms).

What happens if RM finds that one NM's heartbeat is missing but it is not
10 min yet (yarn.nm.liveness-monitor.expiry-interval-ms time is not expired
Will a new application still make container request to that NM via RM?


On Wed, Nov 2, 2016 at 5:41 PM, Ravi Prakash <ravihadoop@gmail.com> wrote:

> Hi Tanvir!
> Its hard to have some configuration that works for all cluster scenarios.
> I suspect that value was chosen as somewhat a mirror of the time it takes
> HDFS to realize a datanode is dead (which is also 10 mins from what I
> remember). The RM also has to reschedule the work when that timeout
> expires. Also there may be network glitches which could last that
> long...... Also, the NMs are pretty stable by themselves. Failing NMs have
> not been too common in my experience.
> Ravi
> On Wed, Nov 2, 2016 at 10:44 AM, Tanvir Rahman <tanvir9982000@gmail.com>
> wrote:
>> Hello,
>> Can anyone please tell me why the default value of '
>> yarn.resourcemanager.container.liveness-monitor.interval-ms' in
>> yarn-default.xml
>> <https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml>
>> so high? This parameter determines "How often to check that containers
>> are still alive". The default value is 60000 ms or 10 minutes. So if a
>> node manager fails, the resource manager detects the dead container after
>> 10 minutes.
>> I am running a wordcount code in my university cluster. In the middle of
>> run, I stopped node manager of one node (the data node is still running)
>> and found that the completion time increases about 10 minutes because of
>> the node manager failure.
>> Thanks in advance
>> Tanvir

View raw message