hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Greg Roelofs (JIRA)" <j...@apache.org>
Subject [jira] Created: (MAPREDUCE-2041) TaskRunner logDir race condition leads to crash on job-acl.xml creation
Date Sat, 28 Aug 2010 05:18:54 GMT
TaskRunner logDir race condition leads to crash on job-acl.xml creation

                 Key: MAPREDUCE-2041
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2041
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: task
    Affects Versions: 0.22.0
         Environment: Linux/x86-64, 32-bit Java, NFS source tree
            Reporter: Greg Roelofs

TaskRunner's prepareLogFiles() warns on mkdirs() failures but ignores them.  It also fails
even to check the return value of setPermissions().  Either one can fail (e.g., on NFS, where
there appears to be a TOCTOU-style race, except with C = "creation"), in which case the subsequent
creation of job-acl.xml in writeJobACLs() will also fail, killing the task:

2010-08-26 20:18:10,334 INFO  mapred.TaskInProgress (TaskInProgress.java:updateStatus(591))
- Error from attempt_20100826201758813_0001_m_000001_0 on tracker_host2.rack.com:rh45-64/
java.lang.Throwable: Child Error
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:229)
Caused by: java.io.FileNotFoundException: /home/<username>/grid/trunk/hadoop-mapreduce/build/test/logs/userlogs/job_20100826201758813_0001/attempt_20100826201758813_0001_m_000001_0/job-acl.xml
(No such file or directory)
    at java.io.FileOutputStream.open(Native Method)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:179)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:131)
    at org.apache.hadoop.mapred.TaskRunner.writeJobACLs(TaskRunner.java:307)
    at org.apache.hadoop.mapred.TaskRunner.prepareLogFiles(TaskRunner.java:290)
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:199)

This in turn causes TestTrackerBlacklistAcrossJobs to fail sporadically; the job-acl.xml failure
always seems to affect host2 - and to do so more quickly than the intentional exception on
host1 - which triggers an assertion failure due to the wrong host being job-blacklisted.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message