Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2116AC1BC for ; Wed, 10 Dec 2014 09:06:02 +0000 (UTC) Received: (qmail 56616 invoked by uid 500); 10 Dec 2014 09:05:57 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 56513 invoked by uid 500); 10 Dec 2014 09:05:57 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 56503 invoked by uid 99); 10 Dec 2014 09:05:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Dec 2014 09:05:56 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of wangfei1@huawei.com designates 119.145.14.66 as permitted sender) Received: from [119.145.14.66] (HELO szxga03-in.huawei.com) (119.145.14.66) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Dec 2014 09:05:50 +0000 Received: from 172.24.2.119 (EHLO szxeml412-hub.china.huawei.com) ([172.24.2.119]) by szxrg03-dlp.huawei.com (MOS 4.4.3-GA FastPath queued) with ESMTP id AYM38413; Wed, 10 Dec 2014 16:59:09 +0800 (CST) Received: from [127.0.0.1] (10.177.17.18) by szxeml412-hub.china.huawei.com (10.82.67.91) with Microsoft SMTP Server id 14.3.158.1; Wed, 10 Dec 2014 16:59:05 +0800 Message-ID: <54880B56.3050805@huawei.com> Date: Wed, 10 Dec 2014 16:59:02 +0800 From: scwf User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:17.0) Gecko/20130509 Thunderbird/17.0.6 MIME-Version: 1.0 To: Subject: Re: Question about container recovery References: <5487DC93.9080105@huawei.com> In-Reply-To: <5487DC93.9080105@huawei.com> Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.177.17.18] X-CFilter-Loop: Reflected X-Mirapoint-Virus-RAPID-Raw: score=unknown(0), refid=str=0001.0A020203.54880B5D.028B,ss=1,re=0.001,recu=0.000,reip=0.000,cl=1,cld=1,fgs=0, ip=0.0.0.0, so=2013-05-26 15:14:31, dmn=2013-03-21 17:37:32 X-Mirapoint-Loop-Id: 1ba8e884bd3798545c0186bdad6bb030 X-Virus-Checked: Checked by ClamAV on apache.org It seems there is a blacklist in yarn when all containers of one NM lost, it will add this NM to blacklist? Then when will the NM go out of blacklist? On 2014/12/10 13:39, scwf wrote: > Hi, all > Here is my question: is there a mechanisms that when one container exit abnormally, yarn will prefer to dispatch the container on other NM? > > We have a cluster with 3 NMs(each NM 135g mem) and 1 RM, and we running a job which start 13 container(= 1 AM + 12 executor containers). > > Each NM has 4 executor container and the mem configured for each executor container is 30g. There is a interesting test, when we killed > > 4 containers in one NM1, only 2 containers restarted on NM1, other 2 containers reserved on the NM2 and NM3. > > Any idea? > > Fei. > > >