Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of wangfei1@huawei.com
 designates 119.145.14.66 as permitted sender)
Message-ID: <54880B56.3050805@huawei.com>
Date: Wed, 10 Dec 2014 16:59:02 +0800
From: scwf <wangfei1@huawei.com>
User-Agent: Mozilla/5.0 (Windows NT 6.1;
 rv:17.0) Gecko/20130509 Thunderbird/17.0.6
MIME-Version: 1.0
To: <user@hadoop.apache.org>
Subject: Re: Question about container recovery
References: <5487DC93.9080105@huawei.com>
In-Reply-To: <5487DC93.9080105@huawei.com>
Content-Type: text/plain; charset="ISO-8859-1"; format=flowed
Content-Transfer-Encoding: 7bit

It seems there is a blacklist in yarn when all containers of one NM lost, it will add this NM to blacklist? Then when will the NM go out of blacklist?

On 2014/12/10 13:39, scwf wrote:
> Hi, all
>    Here is my question: is there a mechanisms that when one container exit abnormally, yarn will prefer to dispatch the container on other NM?
>
> We have a cluster with 3 NMs(each NM 135g mem) and 1 RM, and we running a job which start 13 container(= 1 AM + 12 executor containers).
>
> Each NM has 4 executor container and the mem configured for each executor container is 30g. There is a interesting test, when we killed
>
> 4 containers in one NM1, only 2 containers restarted on NM1, other 2 containers reserved on the NM2 and NM3.
>
>    Any idea?
>
> Fei.
>
>
>