Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 78752 invoked from network); 26 Mar 2009 02:24:16 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 26 Mar 2009 02:24:16 -0000 Received: (qmail 46084 invoked by uid 500); 26 Mar 2009 02:24:14 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 46026 invoked by uid 500); 26 Mar 2009 02:24:14 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 46016 invoked by uid 99); 26 Mar 2009 02:24:14 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Mar 2009 02:24:13 +0000 X-ASF-Spam-Status: No, hits=-1.0 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of gcjlhu-hadoop-user@m.gmane.org designates 80.91.229.2 as permitted sender) Received: from [80.91.229.2] (HELO ciao.gmane.org) (80.91.229.2) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Mar 2009 02:24:06 +0000 Received: from list by ciao.gmane.org with local (Exim 4.43) id 1LmfFr-0005OP-0i for core-user@hadoop.apache.org; Thu, 26 Mar 2009 02:23:39 +0000 Received: from adsl-70-235-18-124.dsl.ltrkar.sbcglobal.net ([70.235.18.124]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Thu, 26 Mar 2009 02:23:39 +0000 Received: from sales by adsl-70-235-18-124.dsl.ltrkar.sbcglobal.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Thu, 26 Mar 2009 02:23:39 +0000 X-Injected-Via-Gmane: http://gmane.org/ To: core-user@hadoop.apache.org From: "Billy Pearson" Subject: reduce task failing after 24 hours waiting Date: Wed, 25 Mar 2009 21:23:29 -0500 Lines: 14 Message-ID: Mime-Version: 1.0 Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=original Content-Transfer-Encoding: 7bit X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: adsl-70-235-18-124.dsl.ltrkar.sbcglobal.net X-MSMail-Priority: Normal X-Newsreader: Microsoft Windows Mail 6.0.6001.18000 X-MimeOLE: Produced By Microsoft MimeOLE V6.0.6001.18049 Sender: news X-Virus-Checked: Checked by ClamAV on apache.org I am seeing on one of my long running jobs about 50-60 hours that after 24 hours all active reduce task fail with the error messages java.io.IOException: Task process exit with nonzero status of 255. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418) Is there something in the config that I can change to stop this? Every time with in 1 min of 24 hours they all fail at the same time. waist a lot of resource downloading the map outputs and merging them again. Billy