hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Greg Roelofs (JIRA)" <j...@apache.org>
Subject [jira] Updated: (MAPREDUCE-2041) TaskRunner logDir race condition leads to crash on job-acl.xml creation
Date Sat, 28 Aug 2010 05:27:53 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-2041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Greg Roelofs updated MAPREDUCE-2041:

    Attachment: MR-2041.v1.trunk-hadoop-mapreduce.patch

Patch that improves TaskRunner's error-checking.  This makes the failure mechanism more obvious
but does not address the nondeterministic behavior of TestTrackerBlacklistAcrossJobs.  (A
minor tweak - removing the "throw ie;" line - _does_ fix the test.  However, I'm assuming
we don't want to ignore the failure to create job-acl.xml in the general case.)

> TaskRunner logDir race condition leads to crash on job-acl.xml creation
> -----------------------------------------------------------------------
>                 Key: MAPREDUCE-2041
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2041
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: task
>    Affects Versions: 0.22.0
>         Environment: Linux/x86-64, 32-bit Java, NFS source tree
>            Reporter: Greg Roelofs
>         Attachments: MR-2041.v1.trunk-hadoop-mapreduce.patch
> TaskRunner's prepareLogFiles() warns on mkdirs() failures but ignores them.  It also
fails even to check the return value of setPermissions().  Either one can fail (e.g., on NFS,
where there appears to be a TOCTOU-style race, except with C = "creation"), in which case
the subsequent creation of job-acl.xml in writeJobACLs() will also fail, killing the task:
> {noformat}
> 2010-08-26 20:18:10,334 INFO  mapred.TaskInProgress (TaskInProgress.java:updateStatus(591))
- Error from attempt_20100826201758813_0001_m_000001_0 on tracker_host2.rack.com:rh45-64/
java.lang.Throwable: Child Error
>     at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:229)
> Caused by: java.io.FileNotFoundException: /home/<username>/grid/trunk/hadoop-mapreduce/build/test/logs/userlogs/job_20100826201758813_0001/attempt_20100826201758813_0001_m_000001_0/job-acl.xml
(No such file or directory)
>     at java.io.FileOutputStream.open(Native Method)
>     at java.io.FileOutputStream.<init>(FileOutputStream.java:179)
>     at java.io.FileOutputStream.<init>(FileOutputStream.java:131)
>     at org.apache.hadoop.mapred.TaskRunner.writeJobACLs(TaskRunner.java:307)
>     at org.apache.hadoop.mapred.TaskRunner.prepareLogFiles(TaskRunner.java:290)
>     at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:199)
> {noformat}
> This in turn causes TestTrackerBlacklistAcrossJobs to fail sporadically; the job-acl.xml
failure always seems to affect host2 - and to do so more quickly than the intentional exception
on host1 - which triggers an assertion failure due to the wrong host being job-blacklisted.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message