Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 76847 invoked from network); 6 Jun 2006 03:13:33 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 6 Jun 2006 03:13:33 -0000 Received: (qmail 55395 invoked by uid 500); 6 Jun 2006 03:13:33 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 55236 invoked by uid 500); 6 Jun 2006 03:13:32 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 55227 invoked by uid 99); 6 Jun 2006 03:13:32 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Jun 2006 20:13:32 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [209.237.227.198] (HELO brutus.apache.org) (209.237.227.198) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Jun 2006 20:13:31 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id E1DC67142B7 for ; Tue, 6 Jun 2006 03:12:31 +0000 (GMT) Message-ID: <15700247.1149563551922.JavaMail.jira@brutus> Date: Tue, 6 Jun 2006 03:12:31 +0000 (GMT+00:00) From: "Owen O'Malley (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Updated: (HADOOP-277) Race condition in Configuration.getLocalPath() In-Reply-To: <674337.1149556349857.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N [ http://issues.apache.org/jira/browse/HADOOP-277?page=all ] Owen O'Malley updated HADOOP-277: --------------------------------- Attachment: mkdirs.patch This patch is closer to what we did for the routine above it last week. (Sorry about not fixing this one too at the same time. It wasn't biting us, but that was no reason not to fix the obviously parallel code.) Is there some reason that you need the synchronized block around the mkdirs? File.mkdirs does a File.exists internally, so you don't need to call it yourself. > Race condition in Configuration.getLocalPath() > ---------------------------------------------- > > Key: HADOOP-277 > URL: http://issues.apache.org/jira/browse/HADOOP-277 > Project: Hadoop > Type: Bug > Environment: linux, 64 bit, dual core, 4x400GB disk, 4GB RAM > Reporter: paul sutter > Attachments: hadoop-277.patch, hadoop-task_1_r_9.log, mkdirs.patch > > (attached: a patch to fix the problem, and a logfile showing the problem occuring twice) > There is a race condition in Configuration.java: > Path file = new Path(dirs[index], path); > Path dir = file.getParent(); > if (fs.exists(dir) || fs.mkdirs(dir)) { > return file; > If two threads simultaneously process this code with the same target directory, fs.exists() will return false, but from fs.mkdirs() only one of the two threads will return true. From the Java documentation: > "returns: true if and only if the directory was created, along with all necessary parent directories; false otherwise" > That is, if the first thread successfully creates the directory, the second will not, and therefore return false, even though the directory exists. > This was really happening. We use four temporary directories, and we had reducers failing all over the place with bizarre impossible errors. I modified the ReduceTaskRunner to output the filename that it creates to find the problem, and the log output is below. > Here you can see copies initiated for two files that hash to the same temp directory, simultaneously. map_4.out is created in the correct directory (/data2...), but map_15.out is created in the next directory (/data3...) becuase of this race condition. Minutes later, when the appender tries to locate the file, that race condition does not occur (the directory already exists), and the appender looks for the file map_15.out in the correct directory, where it does not exist. > 060605 142414 task_0001_r_000009_1 Copying task_0001_m_000004_0 output from rmr05. > 060605 142414 task_0001_r_000009_1 Copying task_0001_m_000015_0 output from rmr04. > ... > 060605 142416 task_0001_r_000009_1 done copying task_0001_m_000004_0 output from rmr05 into /data2/tmp/mapred/local/task_0001_r_000009_1/map_4.out > ... > 060605 142418 task_0001_r_000009_1 done copying task_0001_m_000015_0 output from rmr04 into /data3/tmp/mapred/local/task_0001_r_000009_1/map_15.out > ... > 060605 142531 task_0001_r_000009_1 0.31808624% reduce > append > /data2/tmp/mapred/local/task_0001_r_000009_1/map_4.out > ... > 060605 142725 task_0001_r_000009_1 java.io.FileNotFoundException: /data2/tmp/mapred/local/task_0001_r_000009_1/map_15.out -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira