hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Greg Roelofs (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-2041) TaskRunner logDir race condition leads to crash on job-acl.xml creation
Date Tue, 31 Aug 2010 03:26:54 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904508#action_12904508
] 

Greg Roelofs commented on MAPREDUCE-2041:
-----------------------------------------

Just FYI, I ran TestTrackerBlacklistAcrossJobs 16 times without any failures on a local (ext3
on md) filesystem on the same node as above.  The nondeterminism definitely seems to be associated
either with non-guaranteed filesystem semantics (i.e., bad assumptions in the test and/or
MR code) or with network timing and asynchronous function calls (which I guess also devolves
to bad assumptions in the test and/or MR code).  Given that NFS is normally relevant only
for development and personal runs of "ant test" (and frequently not even there), this doesn't
seem like critical problem.

If anyone ever wants to track down other NFS-related failures, however, this might be a useful
test case to get started.

> TaskRunner logDir race condition leads to crash on job-acl.xml creation
> -----------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2041
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2041
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: task
>    Affects Versions: 0.22.0
>         Environment: Linux/x86-64, 32-bit Java, NFS source tree
>            Reporter: Greg Roelofs
>         Attachments: MR-2041.v1.trunk-hadoop-mapreduce.patch
>
>
> TaskRunner's prepareLogFiles() warns on mkdirs() failures but ignores them.  It also
fails even to check the return value of setPermissions().  Either one can fail (e.g., on NFS,
where there appears to be a TOCTOU-style race, except with C = "creation"), in which case
the subsequent creation of job-acl.xml in writeJobACLs() will also fail, killing the task:
> {noformat}
> 2010-08-26 20:18:10,334 INFO  mapred.TaskInProgress (TaskInProgress.java:updateStatus(591))
- Error from attempt_20100826201758813_0001_m_000001_0 on tracker_host2.rack.com:rh45-64/127.0.0.1:35112:
java.lang.Throwable: Child Error
>     at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:229)
> Caused by: java.io.FileNotFoundException: /home/<username>/grid/trunk/hadoop-mapreduce/build/test/logs/userlogs/job_20100826201758813_0001/attempt_20100826201758813_0001_m_000001_0/job-acl.xml
(No such file or directory)
>     at java.io.FileOutputStream.open(Native Method)
>     at java.io.FileOutputStream.<init>(FileOutputStream.java:179)
>     at java.io.FileOutputStream.<init>(FileOutputStream.java:131)
>     at org.apache.hadoop.mapred.TaskRunner.writeJobACLs(TaskRunner.java:307)
>     at org.apache.hadoop.mapred.TaskRunner.prepareLogFiles(TaskRunner.java:290)
>     at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:199)
> {noformat}
> This in turn causes TestTrackerBlacklistAcrossJobs to fail sporadically; the job-acl.xml
failure always seems to affect host2 - and to do so more quickly than the intentional exception
on host1 - which triggers an assertion failure due to the wrong host being job-blacklisted.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message