Return-Path: Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: (qmail 42888 invoked from network); 31 Aug 2010 03:27:37 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 31 Aug 2010 03:27:37 -0000 Received: (qmail 93391 invoked by uid 500); 31 Aug 2010 03:27:37 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 93290 invoked by uid 500); 31 Aug 2010 03:27:34 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 93282 invoked by uid 99); 31 Aug 2010 03:27:33 -0000 Received: from Unknown (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 31 Aug 2010 03:27:33 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 31 Aug 2010 03:27:15 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o7V3QsJn016709 for ; Tue, 31 Aug 2010 03:26:54 GMT Message-ID: <3341609.85941283225214196.JavaMail.jira@thor> Date: Mon, 30 Aug 2010 23:26:54 -0400 (EDT) From: "Greg Roelofs (JIRA)" To: mapreduce-issues@hadoop.apache.org Subject: [jira] Commented: (MAPREDUCE-2041) TaskRunner logDir race condition leads to crash on job-acl.xml creation In-Reply-To: <30021165.47671282972734171.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/MAPREDUCE-2041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904508#action_12904508 ] Greg Roelofs commented on MAPREDUCE-2041: ----------------------------------------- Just FYI, I ran TestTrackerBlacklistAcrossJobs 16 times without any failures on a local (ext3 on md) filesystem on the same node as above. The nondeterminism definitely seems to be associated either with non-guaranteed filesystem semantics (i.e., bad assumptions in the test and/or MR code) or with network timing and asynchronous function calls (which I guess also devolves to bad assumptions in the test and/or MR code). Given that NFS is normally relevant only for development and personal runs of "ant test" (and frequently not even there), this doesn't seem like critical problem. If anyone ever wants to track down other NFS-related failures, however, this might be a useful test case to get started. > TaskRunner logDir race condition leads to crash on job-acl.xml creation > ----------------------------------------------------------------------- > > Key: MAPREDUCE-2041 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2041 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: task > Affects Versions: 0.22.0 > Environment: Linux/x86-64, 32-bit Java, NFS source tree > Reporter: Greg Roelofs > Attachments: MR-2041.v1.trunk-hadoop-mapreduce.patch > > > TaskRunner's prepareLogFiles() warns on mkdirs() failures but ignores them. It also fails even to check the return value of setPermissions(). Either one can fail (e.g., on NFS, where there appears to be a TOCTOU-style race, except with C = "creation"), in which case the subsequent creation of job-acl.xml in writeJobACLs() will also fail, killing the task: > {noformat} > 2010-08-26 20:18:10,334 INFO mapred.TaskInProgress (TaskInProgress.java:updateStatus(591)) - Error from attempt_20100826201758813_0001_m_000001_0 on tracker_host2.rack.com:rh45-64/127.0.0.1:35112: java.lang.Throwable: Child Error > at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:229) > Caused by: java.io.FileNotFoundException: /home//grid/trunk/hadoop-mapreduce/build/test/logs/userlogs/job_20100826201758813_0001/attempt_20100826201758813_0001_m_000001_0/job-acl.xml (No such file or directory) > at java.io.FileOutputStream.open(Native Method) > at java.io.FileOutputStream.(FileOutputStream.java:179) > at java.io.FileOutputStream.(FileOutputStream.java:131) > at org.apache.hadoop.mapred.TaskRunner.writeJobACLs(TaskRunner.java:307) > at org.apache.hadoop.mapred.TaskRunner.prepareLogFiles(TaskRunner.java:290) > at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:199) > {noformat} > This in turn causes TestTrackerBlacklistAcrossJobs to fail sporadically; the job-acl.xml failure always seems to affect host2 - and to do so more quickly than the intentional exception on host1 - which triggers an assertion failure due to the wrong host being job-blacklisted. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.