hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Devaraj Das (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-5146) LocalDirAllocator misses files on the local filesystem
Date Thu, 26 Feb 2009 11:41:01 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-5146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Devaraj Das updated HADOOP-5146:

    Attachment: 5146.patch

After some analysis, found the cause of the race condition:
1) Assume a fresh hdfs cluster with no files. Run a job (foo) that generates a file in the
hdfs called partition.lst
2) Run a job (bar) that uses the file foo generated. On a given node, one task localizes partition.lst
in the dist cache and other tasks simply use this 
3) bar job finishes successfully without any task failures.... 
4) Now run foo again. This will regenerate the file partition.lst at the same location.
5) Run bar again. On a given node that was used by the previous bar job, a task t1 from the
new bar job will still find the partition.lst in the cache in ifExists() check. Context now
switches to another taskrunner thread, say t2 of the new bar job. 
6) t2 also finds that ifExists() returns true but when it does getLocalCache, it finds the
file to be stale (since the file got regenerated in the foo job again) and deletes it in DistributedCache.localizeCache.
Context now switches back to t1. 
7) t1 does getLocalPathToRead and doesn't find the file... For t1, this is a situation where
ifExists() returns true, but getLocalPathToRead returns false for the same path. This is the
race condition..

The attached patch removes the call to ifExists/getLocalPathToRead in the TaskRunner thread
during the cache localization. It always does getLocalPathForWrite. In the case where the
file is already localized, the path returned by getLocalPathForWrite will not be used and
instead getLocalCache will return the already localized path.

> LocalDirAllocator misses files on the local filesystem
> ------------------------------------------------------
>                 Key: HADOOP-5146
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5146
>             Project: Hadoop Core
>          Issue Type: Bug
>    Affects Versions: 0.20.0
>            Reporter: Arun C Murthy
>            Assignee: Devaraj Das
>            Priority: Blocker
>             Fix For: 0.20.0
>         Attachments: 5146.patch, 5146.patch, 5146_20090204job.output.txt, localdirallocator.patch,
> For some reason the LocalDirAllocator.getLocaPathToRead doesn't find files which are
present, extra logging shows:
> {noformat}
> 2009-01-30 06:43:32,312 INFO org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext:
in ifExists, /grid/2/arunc/mapred-local/taskTracker/archive/xxx.yyy.com/tera/in/_partition.lst
> 2009-01-30 06:43:32,389 WARN org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext:
in getLocalPathToRead, taskTracker/archive/xxx.yyy.com/tera/in/_partition.lst doesn't exist
> 2009-01-30 06:43:32,390 WARN org.apache.hadoop.mapred.TaskRunner: attempt_200901300512_0007_m_000055_0
Child Error
>  org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/archive/xx.yyy.com/tera/in/_partition.lst
in any of the configured local directories
>          at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:388)
>          at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)
>          at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:172)
> {noformat}

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message