hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arun C Murthy (JIRA)" <j...@apache.org>
Subject [jira] Updated: (MAPREDUCE-1098) Incorrect synchronization in DistributedCache causes TaskTrackers to freeze up during localization of Cache for tasks.
Date Thu, 22 Oct 2009 10:14:59 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Arun C Murthy updated MAPREDUCE-1098:
-------------------------------------

    Status: Open  (was: Patch Available)

This patch has a corner-case synchronization bug: it relies on CacheStatus.markForDeletion
flag, however getLocalCache and deleteCache could silently corrupt the distributed-cache since
they are looking at *different* CacheStatus objects - there-by rendering the checks based
on the CacheStatus.markForDeletion useless.

----

The above problems arise since the DistributedCache is currently structured to share the same
underlying local file-system path across all CacheStatus objects. Effectively there is a 1-1
mapping between between files on HDFS and their localized counterparts.

I'm thinking a slightly different solution to the problem exhibited by this patch is to break
the 1-1 mapping between files on HDFS and the localized files and get the CacheStatus objects
to own the unique localized paths. The proposal is to have a unique CacheStatus.localLoadPath
per object and to initialize them via copies from HDFS from src files to unique localized
files. Thus we can then continue to keep the current (correct) structure for deleteCache and
put smarts in getLocalCache to copy on init of CacheStatus.


> Incorrect synchronization in DistributedCache causes TaskTrackers to freeze up during
localization of Cache for tasks.
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1098
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1098
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: tasktracker
>            Reporter: Sreekanth Ramakrishnan
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.21.0
>
>         Attachments: patch-1098-0.20.txt, patch-1098-1.txt, patch-1098-2.txt, patch-1098-ydist.txt,
patch-1098.txt
>
>
> Currently {{org.apache.hadoop.filecache.DistributedCache.getLocalCache(URI, Configuration,
Path, FileStatus, boolean, long, Path, boolean)}} allows only one {{TaskRunner}} thread in
TT to localize {{DistributedCache}} across jobs. Current way of synchronization is across
baseDir this has to be changed to lock on the same baseDir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message