hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-4163) If a reducer failed at shuffling stage, the task should fail, not just logging an exception
Date Tue, 30 Sep 2008 01:21:44 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris Douglas updated HADOOP-4163:
----------------------------------

    Status: Open  (was: Patch Available)

* handleIfFSError(t) doesn't need to be called in contexts where mergeThrowable is set. Equivalent
code should be called after ReduceCopier::fetchOutputs returns false
* Code handling FSError should be in a catch block, not handled using instanceof in a method
call from a catch of Throwable. The retry loop is unnecessary. The call to System.exit is
overly aggressive. (i.e. handleIfFSError should not exist)
* Discarding map output cannot generate FSError and does not require handling.

This should be replaced with a catch of FSError before Throwable in MapOutputCopier::run that
calls umbilical.fsError (if it throws, the exception can be logged and ignored). If reduceCopier.fetchOutputs
returns false, then reduceCopier.mergeThrowable should be the cause of the thrown exception
(it's OK if it's null). If mergeThrowable is FSError, it would be reasonable to call umbilical.fsError
before the throw.

> If a reducer failed at shuffling stage, the task should fail, not just logging an exception
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4163
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4163
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.17.1
>            Reporter: Runping Qi
>            Assignee: Sharad Agarwal
>            Priority: Blocker
>             Fix For: 0.19.0
>
>         Attachments: 4163_v1.patch
>
>
> I saw a reducer stuck at the shuffling stage, with the following exception logged in
the log file:
> 2008-08-30 00:16:23,265 ERROR org.apache.hadoop.mapred.ReduceTask: Map output copy failure:
org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:199)
> 	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> 	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> 	at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:332)
> 	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
> 	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
> 	at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:185)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
> Caused by: java.io.IOException: No space left on device
> 	at java.io.FileOutputStream.writeBytes(Native Method)
> 	at java.io.FileOutputStream.write(FileOutputStream.java:260)
> 	at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:197)
> 	... 11 more
> 2008-08-30 00:16:23,320 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: task_200808291851_0001_r_000023_0The reduce copier failed
> 	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:329)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
> The task should have died.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message