hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hemanth Yamijala (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1098) Incorrect synchronization in DistributedCache causes TaskTrackers to freeze up during localization of Cache for tasks.
Date Thu, 22 Oct 2009 16:00:59 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12768731#action_12768731
] 

Hemanth Yamijala commented on MAPREDUCE-1098:
---------------------------------------------

Arun, this may not work as well.

Basically, the localization code is like this:

{code}
synchronized (cachedArchives) {
  get lcacheStatus
  synchronized (lcacheStatus) {
    increment reference count
  }
}
synchronized (lcacheStatus) {
  localize cache
}
{code}

The delete cache code is like this:

{code}
synchronized (cachedArchives) {
  for each lcacheStatus {
    synchronized (lcacheStatus) {
      if (lcacheStatus.refCount == 0) {
        //
      }
    }
  }
}
{code}

The problem is when iterating to delete, if a localizing thread is localizing a cache file
for a particular cache object, the delete thread will wait to acquire the lock on the cache
object *after* acquiring the global lock. Since the localization could take a long time, other
threads will be blocked, in effect not solving the problem we trying to.

Does this make sense ?

It seems like a very correct approach should *not* require to lock any object when doing a
costly operation like a DFS download. Other threads should wait for a download complete event
notification, or some such. But those are sweeping changes. 

One solution Amarsri and I discussed was to see if making the reference count an AtomicInteger
would help. Then maybe, its value can be read without having to acquire a lock on the cache
status object. So, the delete code will be something like this:

{code}
synchronized (cachedArchives) {
  for each lcacheStatus {
    if (lcacheStatus.atomicReferenceCount.get() == 0) {
      synchronized (lcacheStatus) {
        // continue operation as in your patch.
      }
    }
  }
}
{code}

Since we are guaranteed that code that's localizing a path will have the reference count as
non-zero, it will never try and proceed to the delete operation.

Could this work ?

> Incorrect synchronization in DistributedCache causes TaskTrackers to freeze up during
localization of Cache for tasks.
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1098
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1098
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: tasktracker
>            Reporter: Sreekanth Ramakrishnan
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.21.0
>
>         Attachments: MAPREDUCE-1098.patch, patch-1098-0.20.txt, patch-1098-1.txt, patch-1098-2.txt,
patch-1098-ydist.txt, patch-1098.txt
>
>
> Currently {{org.apache.hadoop.filecache.DistributedCache.getLocalCache(URI, Configuration,
Path, FileStatus, boolean, long, Path, boolean)}} allows only one {{TaskRunner}} thread in
TT to localize {{DistributedCache}} across jobs. Current way of synchronization is across
baseDir this has to be changed to lock on the same baseDir.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message