Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id BB690200BB1 for ; Thu, 3 Nov 2016 23:22:18 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id BA0FF160AFF; Thu, 3 Nov 2016 22:22:18 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B2C05160AE5 for ; Thu, 3 Nov 2016 23:22:17 +0100 (CET) Received: (qmail 8042 invoked by uid 500); 3 Nov 2016 22:22:15 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 8031 invoked by uid 99); 3 Nov 2016 22:22:15 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Nov 2016 22:22:15 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 1C261C0750 for ; Thu, 3 Nov 2016 22:22:15 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.38 X-Spam-Level: ** X-Spam-Status: No, score=2.38 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id SwnSQAV9nNpA for ; Thu, 3 Nov 2016 22:22:12 +0000 (UTC) Received: from mail-yw0-f171.google.com (mail-yw0-f171.google.com [209.85.161.171]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 60BFF5F5A1 for ; Thu, 3 Nov 2016 22:22:12 +0000 (UTC) Received: by mail-yw0-f171.google.com with SMTP id t125so69014918ywc.1 for ; Thu, 03 Nov 2016 15:22:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=9nvHbF6b6rrcRH3jduywy6p17GUyfEmiW9QDLaEfPTk=; b=WAmWNiWPckHVRzVRGWbYf3W/dsDXfiuftrtThoPaDfZoC3gg4Lx2Y3ndUnViHEW9Bp HSF0oCpBrV4QqI+ybacITEJV2mHtTzaCsGixUqjhMXvNVKlzpCdDZLVkuljqYIGb6ZAo OLyRvHhDpSeJjZ69O9m25NZF+d2cCT/m9ulVYAQp/SsZmYyuw9xVCppSKY1WqRb6UaNs VZ6j9MpfhmniojVXZzpROeU35tUXzjhL9DlJczQalMIs8SOqj0uy+Mw3AY8b+wusujSN YUP6iF8pdaZR4HFqO3ukSPdxj5r4DxFPmA9DzgGlcYVjNEEI+qOAlsHeSOb6GpzlH/hj WHSw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=9nvHbF6b6rrcRH3jduywy6p17GUyfEmiW9QDLaEfPTk=; b=G6p6utxpD0T1MmP4A7SgnFyggMr/2MdQRd7CssJXfvQoPvn9x8Sayhh/1Xe/GVAR+p LK+9bQ1ompPL9E5AFrslrD3nOiLdjk6EBWlnn7qqna7HTBlgJWsnSLEAd9LBDXZ56ZwQ Mt04EFwE62O22JBPpstCjIEmG0ahGDrUhwPe7oKwJNM7tjG8FYG43cHzJAFXkDeAXfeB b1u4Ysu6FhYPteUGgTr5+3Uxn1yobXylRi4WZr1QJw1cy8HYw9VEmjUvDHZVwHXQFrc5 3DZ2MM7MOPQeQY9lOEixv5atylBXBgmEmaUXLeiLnwTjMWLBo/v+W/JlR/H9ZIV0rsFD 7U9A== X-Gm-Message-State: ABUngvcVaTdw0EI/T8Rf+6xLA8BkpYlH307qViytHxIy4UfXmNkwTAEQA9ikvgFk8a6bmdHAaSh/FoiFl6RSgA== X-Received: by 10.36.101.79 with SMTP id u76mr130492itb.72.1478211731409; Thu, 03 Nov 2016 15:22:11 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.191.135 with HTTP; Thu, 3 Nov 2016 15:22:10 -0700 (PDT) In-Reply-To: References: From: Ravi Prakash Date: Thu, 3 Nov 2016 15:22:10 -0700 Message-ID: Subject: Re: why the default value of 'yarn.resourcemanager.container.liveness-monitor.interval-ms' in yarn-default.xml is so high? To: Tanvir Rahman Cc: user Content-Type: multipart/alternative; boundary=001a113f821a27533305406cfd0b archived-at: Thu, 03 Nov 2016 22:22:18 -0000 --001a113f821a27533305406cfd0b Content-Type: text/plain; charset=UTF-8 Hi Tanvir! Although an application may request for that node, a container won't be scheduled until the nodemanager sends a heartbeat. If the application hasn't specified a preference for that node, then whichever node heartbeats next, will be used to launch a container. HTH Ravi On Thu, Nov 3, 2016 at 12:12 PM, Tanvir Rahman wrote: > Thank you Ravi for your reply. > I found one parameter 'yarn.resourcemanager.nm. > liveness-monitor.interval-ms' (default value=1000ms) in yarn-default.xml > (v2.4.1) which determines how often to check that node managers are still > alive. So RM is checking heartbeat of NM every second but it takes 10 min > to decide whether the NM is dead or not. (yarn.nm.liveness-monitor. > expiry-interval-ms: How long to wait until a node manager is considered > dead; default value = 600000 ms). > > What happens if RM finds that one NM's heartbeat is missing but it is not > 10 min yet (yarn.nm.liveness-monitor.expiry-interval-ms time is not > expired yet) > Will a new application still make container request to that NM via RM? > > Thanks > Tanvir > > > > > > On Wed, Nov 2, 2016 at 5:41 PM, Ravi Prakash wrote: > >> Hi Tanvir! >> >> Its hard to have some configuration that works for all cluster scenarios. >> I suspect that value was chosen as somewhat a mirror of the time it takes >> HDFS to realize a datanode is dead (which is also 10 mins from what I >> remember). The RM also has to reschedule the work when that timeout >> expires. Also there may be network glitches which could last that >> long...... Also, the NMs are pretty stable by themselves. Failing NMs have >> not been too common in my experience. >> >> HTH >> Ravi >> >> On Wed, Nov 2, 2016 at 10:44 AM, Tanvir Rahman >> wrote: >> >>> Hello, >>> Can anyone please tell me why the default value of ' >>> yarn.resourcemanager.container.liveness-monitor.interval-ms' in >>> yarn-default.xml >>> is >>> so high? This parameter determines "How often to check that containers >>> are still alive". The default value is 60000 ms or 10 minutes. So if a >>> node manager fails, the resource manager detects the dead container after >>> 10 minutes. >>> >>> >>> I am running a wordcount code in my university cluster. In the middle of >>> run, I stopped node manager of one node (the data node is still running) >>> and found that the completion time increases about 10 minutes because of >>> the node manager failure. >>> >>> Thanks in advance >>> Tanvir >>> >>> >>>> >>> >> > --001a113f821a27533305406cfd0b Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Tanvir!

Although an application may re= quest for that node, a container won't be scheduled until the nodemanag= er sends a heartbeat. If the application hasn't specified a preference = for that node, then whichever node heartbeats next, will be used to launch = a container.

HTH
Ravi

On Thu, Nov 3, 2016 at 12:12 PM, Tanvir= Rahman <tanvir9982000@gmail.com> wrote:
Thank you Ravi for your reply.
I found one par= ameter 'yarn.resourcemanager= .nm.liveness-monitor.interval-ms' (default value=3D1000ms) in yarn-default.xml (v2.4.1) wh= ich determines how often to check that node managers are still alive. So RM= is checking heartbeat of NM every second but it takes 10 min to decide whe= ther the NM is dead or not. (= yarn.nm.liveness-monitor.expiry-interval-ms:=C2=A0How long to wait until a node manager is considered dead;= =C2=A0default value =3D 600000 ms).

What happens if RM= finds that one NM's heartbeat is missing but it is not 10 min yet (yarn.nm.liveness-monitor.ex= piry-interval-ms time is not expired yet)
Will a n= ew application still make container request to that NM via RM?



=C2=A0=

On Wed, Nov 2, 2016 at 5:4= 1 PM, Ravi Prakash <ravihadoop@gmail.com> wrote:
Hi Tanvir!
Its hard to have some configuration that works for all cluster = scenarios. I suspect that value was chosen as somewhat a mirror of the time= it takes HDFS to realize a datanode is dead (which is also 10 mins from wh= at I remember). The RM also has to reschedule the work when that timeout ex= pires. Also there may be network glitches which could last that long...... = Also, the NMs are pretty stable by themselves. Failing NMs have not been to= o common in my experience.

HTH
Ravi
<= /span>

On Wed, Nov 2, 2016 at 10:44 AM, Tanvir Rahman <tanvir9982000@gmai= l.com> wrote:
Hello,
Can anyone ple= ase tell me why the default value of 'yarn.resourcemanager.container.liveness-monitor.i= nterval-ms' in=C2=A0yarn-default.xml=C2=A0is so high?=C2=A0This parameter determines "How often to check that containers are still alive". The default v= alue=C2=A0is 60000 ms or 10 minutes. So if a node manager fails, the= resource manager detects the dead container after 10 minutes.
=

= =C2=A0
I am running a wordcount code i= n my university cluster. In the middle of run, I stopped node manager of on= e node (the data node is still running) and found that the completion time = increases about 10 minutes because of the node manager failure.=C2=A0
=

T= hanks in advance
Tanvir






--001a113f821a27533305406cfd0b--