hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-277) Race condition in Configuration.getLocalPath()
Date Tue, 06 Jun 2006 16:46:31 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-277?page=comments#action_12414990 ] 

Owen O'Malley commented on HADOOP-277:

The File.mkdirs (based on what I see in eclipse) looks like:

    public boolean mkdirs() {
        if (exists()) {
            return false;
        if (mkdir()) {
            return true;
        ... <handle recursive mkdirs>...

in any case, the final mkdir would need to be the last thing done. Without the sync block,
I believe your code is functionally identical to my proposal of:

if (fs.mkdirs(dir) || fs.exists(dir)) {
   return file;

Or am I missing something? If we need to synchronize, we really need to do it everywhere and
do it consistently.

On a side note, the Configuration's getFile roll-over between local directories is problematic.
The problem is that readers need to find the file regardless of where it was written. So if
the writer can spill over to other directories, there should be a findFile(?) that looks in
all of the directories (in the right order) until it finds it. That way readers can find the
file regardless of which directory the writer was spilled in to.

> Race condition in Configuration.getLocalPath()
> ----------------------------------------------
>          Key: HADOOP-277
>          URL: http://issues.apache.org/jira/browse/HADOOP-277
>      Project: Hadoop
>         Type: Bug

>  Environment: linux, 64 bit, dual core, 4x400GB disk, 4GB RAM
>     Reporter: paul sutter
>  Attachments: hadoop-277.patch, hadoop-task_1_r_9.log, mkdirs.patch
> (attached: a patch to fix the problem, and a logfile showing the problem occuring twice)
> There is a race condition in Configuration.java:
>        Path file = new Path(dirs[index], path);
>        Path dir = file.getParent();
>        if (fs.exists(dir) || fs.mkdirs(dir)) {
>          return file;
> If two threads simultaneously process this code with the same target directory, fs.exists()
will return false, but from fs.mkdirs() only one of the two threads will return true. From
the Java documentation:
>  "returns: true if and only if the directory was created, along with all necessary parent
directories; false otherwise"
> That is, if the first thread successfully creates the directory, the second will not,
and therefore return false, even though the directory exists.
> This was really happening. We use four temporary directories, and we had reducers failing
all over the place with  bizarre impossible errors. I modified the ReduceTaskRunner to output
the filename that it creates to find the problem, and the log output is below.
> Here you can see copies initiated for two files that hash to the same temp directory,
simultaneously. map_4.out is created in the correct directory (/data2...), but map_15.out
is created in the next directory (/data3...) becuase of this race condition. Minutes later,
when the appender tries to locate the file, that race condition does not occur (the directory
already exists), and the appender looks for the file map_15.out in the correct directory,
where it does not exist.
> 060605 142414 task_0001_r_000009_1 Copying task_0001_m_000004_0 output from rmr05.
> 060605 142414 task_0001_r_000009_1 Copying task_0001_m_000015_0 output from rmr04.
> ...
> 060605 142416 task_0001_r_000009_1 done copying task_0001_m_000004_0 output from rmr05
into /data2/tmp/mapred/local/task_0001_r_000009_1/map_4.out
> ...
> 060605 142418 task_0001_r_000009_1 done copying task_0001_m_000015_0 output from rmr04
into /data3/tmp/mapred/local/task_0001_r_000009_1/map_15.out
> ...
> 060605 142531 task_0001_r_000009_1 0.31808624% reduce > append > /data2/tmp/mapred/local/task_0001_r_000009_1/map_4.out
> ...
> 060605 142725 task_0001_r_000009_1 java.io.FileNotFoundException: /data2/tmp/mapred/local/task_0001_r_000009_1/map_15.out

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

View raw message