hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bhallamudi Venkata Siva Kamesh (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1213) TaskTrackers restart is very slow because it deletes distributed cache directory synchronously
Date Thu, 17 Feb 2011 15:19:24 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12995845#comment-12995845
] 

Bhallamudi Venkata Siva Kamesh commented on MAPREDUCE-1213:
-----------------------------------------------------------

While analyzing the patch, I found an issue, The below moveAndDelete method is called from
both jobTracker and TaskTracker. JobTracker calls the below snippet on it's JobTracker folder
and TaskTracker on it's TaskTracker folder(ex: /home/hadoop/tasktracker/local). This method
renames the current folder and deletes it asynchronously. Let us assume the deletion step
failed due to some reason (Like abrupt kill or some thing else), then the renamed folders
are never deleted by any one. 




{code:title=MRAsyncDiskService.java|borderStyle=solid}

public boolean moveAndDelete(String volume, String pathName) throws IOException {
    // Move the file right now, so that it can be deleted later
    String newPathName;
    synchronized (this) {
      newPathName = format.format(new Date()) + "_" + uniqueId;
      uniqueId ++;
    }
    newPathName = SUBDIR + Path.SEPARATOR_CHAR + newPathName;

    Path source = new Path(volume, pathName);
    Path target = new Path(volume, newPathName);
    try {
      if (!localFileSystem.rename(source, target)) {
        return false;
      }
    } catch (FileNotFoundException e) {
      // Return false in case that the file is not found.
      return false;
    }
    DeleteTask task = new DeleteTask(volume, pathName, newPathName);
    execute(volume, task);
    return true;
  }
{code}

> TaskTrackers restart is very slow because it deletes distributed cache directory synchronously
> ----------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1213
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1213
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.20.1
>            Reporter: dhruba borthakur
>            Assignee: Zheng Shao
>             Fix For: 0.21.0
>
>         Attachments: MAPREDUCE-1213.1.patch, MAPREDUCE-1213.2.patch, MAPREDUCE-1213.3.patch,
MAPREDUCE-1213.4.patch, MAPREDUCE-1213.branch-0.20.2.patch, MAPREDUCE-1213.branch-0.20.patch
>
>
> We are seeing that when we restart a tasktracker, it tries to recursively delete all
the file in the distributed cache. It invoked FileUtil.fullyDelete() which is very very slow.
This means that the TaskTracker cannot join the cluster for an extended period of time (upto
2 hours for us). The problem is acute if the number of files in a distributed cache is a few-thousands.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message