hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tahir Hashmi (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-1152) Reduce task hang failing in MapOutputCopier.copyOutput
Date Fri, 20 Apr 2007 09:25:15 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tahir Hashmi updated HADOOP-1152:
---------------------------------

    Attachment: 1152.patch

Looked at this with Devaraj yesterday and our theory about why this fails is that in MapTask.MapOutputCopier.copyOutput(),
there's a call to rename a temporary file to the actual .out file. After the rename, another
call is made to get the length of the actual file (which is same as that of the temporary
file, obviously). Between these calls, if a MergeThread flushes the file to .out file to disk
and deletes it, the call to getLength() will fail.

Sameer's suggestion was to simply invoke getLength() on the temporary file before the rename
and discard the value in case the rename fails. After the file is renamed, it should be assumed
that the local code no longer owns it. 1152.patch has these changes incorporated.

> Reduce task hang failing in MapOutputCopier.copyOutput
> ------------------------------------------------------
>
>                 Key: HADOOP-1152
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1152
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Koji Noguchi
>         Assigned To: Tahir Hashmi
>         Attachments: 1152.patch, 1152.workaround.patch
>
>
> We had couple of reduce tasks hang repeating the output below.
> 2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: java.io.IOException:
Path /hadoop/mapred/local/task_0026_r_000307_0/map_7854.out already exists
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.rename(InMemoryFileSystem.java:246)
>   at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:471)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:336)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)
> 2007-03-22 23:57:16,296 WARN org.apache.hadoop.mapred.TaskRunner: task_0026_r_000307_0
adding host ______  to penalty box, next contact in 192 seconds
> ===============================
> Before the above output, there was 
> 2007-03-22 18:15:24,274 ERROR org.apache.hadoop.mapred.TaskRunner: Map output copy failure:
java.lang.NullPointerException
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$FileAttributes.access$300(InMemoryFileSystem.java:416)
>   at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem.getLength(InMemoryFileSystem.java:286)
>   at org.apache.hadoop.fs.FilterFileSystem.getLength(FilterFileSystem.java:178)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceTaskRunner.java:340)
>   at org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunner.java:274)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message