Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 57455 invoked from network); 30 Sep 2008 01:22:36 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 30 Sep 2008 01:22:36 -0000 Received: (qmail 31387 invoked by uid 500); 30 Sep 2008 01:22:33 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 31345 invoked by uid 500); 30 Sep 2008 01:22:33 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 31334 invoked by uid 99); 30 Sep 2008 01:22:33 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Sep 2008 18:22:33 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Sep 2008 01:21:40 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 773A6234C1F7 for ; Mon, 29 Sep 2008 18:21:44 -0700 (PDT) Message-ID: <1245618379.1222737704487.JavaMail.jira@brutus> Date: Mon, 29 Sep 2008 18:21:44 -0700 (PDT) From: "Chris Douglas (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Updated: (HADOOP-4163) If a reducer failed at shuffling stage, the task should fail, not just logging an exception In-Reply-To: <198344745.1221170624300.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Douglas updated HADOOP-4163: ---------------------------------- Status: Open (was: Patch Available) * handleIfFSError(t) doesn't need to be called in contexts where mergeThrowable is set. Equivalent code should be called after ReduceCopier::fetchOutputs returns false * Code handling FSError should be in a catch block, not handled using instanceof in a method call from a catch of Throwable. The retry loop is unnecessary. The call to System.exit is overly aggressive. (i.e. handleIfFSError should not exist) * Discarding map output cannot generate FSError and does not require handling. This should be replaced with a catch of FSError before Throwable in MapOutputCopier::run that calls umbilical.fsError (if it throws, the exception can be logged and ignored). If reduceCopier.fetchOutputs returns false, then reduceCopier.mergeThrowable should be the cause of the thrown exception (it's OK if it's null). If mergeThrowable is FSError, it would be reasonable to call umbilical.fsError before the throw. > If a reducer failed at shuffling stage, the task should fail, not just logging an exception > ------------------------------------------------------------------------------------------- > > Key: HADOOP-4163 > URL: https://issues.apache.org/jira/browse/HADOOP-4163 > Project: Hadoop Core > Issue Type: Bug > Components: mapred > Affects Versions: 0.17.1 > Reporter: Runping Qi > Assignee: Sharad Agarwal > Priority: Blocker > Fix For: 0.19.0 > > Attachments: 4163_v1.patch > > > I saw a reducer stuck at the shuffling stage, with the following exception logged in the log file: > 2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure: org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device > at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199) > at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) > at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) > at java.io.FilterOutputStream.close(FilterOutputStream.java:140) > at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59) > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79) > at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332) > at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59) > at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79) > at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185) > at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815) > at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764) > Caused by: java.io.IOException: No space left on device > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:260) > at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197) > ... 11 more > 2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child > java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329) > at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122) > The task should have died. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.