Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 157A910843 for ; Wed, 11 Feb 2015 10:33:12 +0000 (UTC) Received: (qmail 18745 invoked by uid 500); 11 Feb 2015 10:32:32 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 18642 invoked by uid 500); 11 Feb 2015 10:32:31 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 18632 invoked by uid 99); 11 Feb 2015 10:32:31 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Feb 2015 10:32:31 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of rohithsharmaks@huawei.com designates 119.145.14.64 as permitted sender) Received: from [119.145.14.64] (HELO szxga01-in.huawei.com) (119.145.14.64) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Feb 2015 10:32:05 +0000 Received: from 172.24.2.119 (EHLO szxeml434-hub.china.huawei.com) ([172.24.2.119]) by szxrg01-dlp.huawei.com (MOS 4.3.7-GA FastPath queued) with ESMTP id CJJ96655; Wed, 11 Feb 2015 18:32:01 +0800 (CST) Received: from SZXEML512-MBS.china.huawei.com ([169.254.8.141]) by szxeml434-hub.china.huawei.com ([10.82.67.225]) with mapi id 14.03.0158.001; Wed, 11 Feb 2015 18:31:57 +0800 From: Rohith Sharma K S To: "user@hadoop.apache.org" Subject: RE: Time out after 600 for YARN mapreduce application Thread-Topic: Time out after 600 for YARN mapreduce application Thread-Index: AdBF4fIUKkNIeXRRQI2klnJeeMPP+wAAYsaA Date: Wed, 11 Feb 2015 10:31:57 +0000 Message-ID: <0EE80F6F7A98A64EBD18F2BE839C9115677377C6@szxeml512-mbs.china.huawei.com> References: <955310621550324C9615786FF4B225B10182674098@ysiazmail01> In-Reply-To: <955310621550324C9615786FF4B225B10182674098@ysiazmail01> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.18.168.138] Content-Type: multipart/alternative; boundary="_000_0EE80F6F7A98A64EBD18F2BE839C9115677377C6szxeml512mbschi_" MIME-Version: 1.0 X-CFilter-Loop: Reflected X-Virus-Checked: Checked by ClamAV on apache.org --_000_0EE80F6F7A98A64EBD18F2BE839C9115677377C6szxeml512mbschi_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Looking into attemptID, this is mapper task getting timed out in MapReduce = job. The configuration that can be used to increase the value is 'mapreduc= e.task.timeout'. The task timed out is because if there is no heartbeat from MapperTask(Yarn= Child) to MRAppMaster for 10 mins. Does MR job is custom job? If so any o= peration are you doing in cleanup() of Mapper ? Sometimes there would be po= ssible that if cleanup() of Mapper is taking more time greater than timedou= t configured that result in task to timeout. Thanks & Regards Rohith Sharma K S From: Alexandru Pacurar [mailto:Alexandru.Pacurar@PropertyShark.com] Sent: 11 February 2015 15:34 To: user@hadoop.apache.org Subject: Time out after 600 for YARN mapreduce application Hello, I keep encountering an error when running nutch on hadoop YARN: AttemptID:attempt_1423062241884_9970_m_000009_0 Timed out after 600 secs Some info on my setup. I'm running a 64 nodes cluster with hadoop 2.4.1. Ea= ch node has 4 cores, 1 disk and 24Gb of RAM, and the namenode/resourcemanag= er has the same specs only with 8 cores. I am pretty sure one of these parameters is to the threshold I'm hitting: yarn.am.liveness-monitor.expiry-interval-ms yarn.nm.liveness-monitor.expiry-interval-ms yarn.resourcemanager.nm.liveness-monitor.interval-ms but I would like to understand why. The issue usually appears under heavier load, and most of the time the on t= he next attempts it is successful. Also if I restart the Hadoop cluster the= error goes away for some time. Thanks, Alex --_000_0EE80F6F7A98A64EBD18F2BE839C9115677377C6szxeml512mbschi_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

Looking into attemptID, this is mapper task getting timed out in Map= Reduce job.  The configuration that can be used to increase the value = is ‘mapreduce.task.timeout’.

 

The task timed out is = because if there is no heartbeat from MapperTask(YarnChild) to MRAppMaster = for 10 mins.  Does MR job is custom job?  If so any operation are= you doing in cleanup() of Mapper ? Sometimes there wo= uld be possible that if cleanup() of Mapper is taking more time greater than timedout configured that result in= task to timeout.

 

 

Thanks & Regards

Rohith Sharma K S=

From: Alexandr= u Pacurar [mailto:Alexandru.Pacurar@PropertyShark.com]
Sent: 11 February 2015 15:34
To: user@hadoop.apache.org
Subject: Time out after 600 for YARN mapreduce application

 

Hello,

 

I keep encountering an error when running nutch on h= adoop YARN:

 

AttemptID:attempt_1423062241884_9970_m_000009_0 Time= d out after 600 secs

 

Some info on my setup. I'm running a 64 nodes cluste= r with hadoop 2.4.1. Each node has 4 cores, 1 disk and 24Gb of RAM, and the= namenode/resourcemanager has the same specs only with 8 cores.<= /p>

 

I am pretty sure one of these parameters is to the t= hreshold I'm hitting:

 

yarn.am.liveness-monitor.expiry-interval-ms

yarn.nm.liveness-monitor.expiry-interval-ms

yarn.resourcemanager.nm.liveness-monitor.interval-ms=

 

but I would like to understand why.

 

The issue usually appears under heavier load, and mo= st of the time the on the next attempts it is successful. Also if I restart= the Hadoop cluster the error goes away for some time.

 

Thanks,

Alex

--_000_0EE80F6F7A98A64EBD18F2BE839C9115677377C6szxeml512mbschi_--