hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhihai xu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3491) PublicLocalizer#addResource is too slow.
Date Fri, 17 Apr 2015 09:28:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14499534#comment-14499534
] 

zhihai xu commented on YARN-3491:
---------------------------------

Hi [~jlowe], You are right, I am really sorry all my previous guesses are wrong.
I did the profiling and I find out the bottleneck is at the following code
{code}
getInitializedLocalDirs();
getInitializedLogDirs();
{code}

More accurately the bottleneck is at checkLocalDir which call getFileStatus.
I did two round profiling:
1.I measure the time in PublicLocalizer#addResource:
the following code include levelDB operation take 1 ms.
{code}
            Path publicRootPath =
                dirsHandler.getLocalPathForWrite("." + Path.SEPARATOR
                    + ContainerLocalizer.FILECACHE,
                  ContainerLocalizer.getEstimatedSize(resource), true);
            Path publicDirDestPath =
                publicRsrc.getPathForLocalization(key, publicRootPath);
            if (!publicDirDestPath.getParent().equals(publicRootPath)) {
              DiskChecker.checkDir(new File(publicDirDestPath.toUri().getPath()));
            }
{code}

getInitializedLocalDirs and getInitializedLogDirs take 12 ms together

And the following queue.submit code take less than 1 ms.
{code}
            synchronized (pending) {
              pending.put(queue.submit(new FSDownload(lfs, null, conf,
                  publicDirDestPath, resource, request.getContext().getStatCache())),
                  request);
            }
{code}

2. then I measure the time in getInitializedLocalDirs and getInitializedLogDirs.
I find out checkLocalDir is really slow which is called by getInitializedLocalDirs.
checkLocalDir takes 14 ms. There is only one local Dir in my test environment.
{code}
  synchronized private List<String> getInitializedLocalDirs() {
    List<String> dirs = dirsHandler.getLocalDirs();
    List<String> checkFailedDirs = new ArrayList<String>();
    for (String dir : dirs) {
      try {
        checkLocalDir(dir);
      } catch (YarnRuntimeException e) {
        checkFailedDirs.add(dir);
      }
    }
{code}

The log in my previous comment has more than 10 local Dirs, which will call checkLocalDir
more than 10 times
10 * 14 is about 100+ms, So I find out where the 100+ms delay come from.

I attached a patch YARN-3491.000.patch to fix the issue, The patch will call getInitializedLocalDirs
only once for each container.
The original code will call getInitializedLocalDirs for each public resource. Each container
can have hundreds of public resource, which is the situation in my previous log.

[~jlowe], Could you review it? thanks


> PublicLocalizer#addResource is too slow.
> ----------------------------------------
>
>                 Key: YARN-3491
>                 URL: https://issues.apache.org/jira/browse/YARN-3491
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: nodemanager
>    Affects Versions: 2.7.0
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>            Priority: Critical
>
> Improve the public resource localization to do both FSDownload submission to the thread
pool and completed localization handling in one thread (PublicLocalizer).
> Currently FSDownload submission to the thread pool is done in PublicLocalizer#addResource
which is running in Dispatcher thread and completed localization handling is done in PublicLocalizer#run
which is running in PublicLocalizer thread.
> Because PublicLocalizer#addResource is time consuming, the thread pool can't be fully
utilized. Instead of doing public resource localization in parallel(multithreading), public
resource localization is serialized most of the time.
> Also there are two more benefits with this change:
> 1. The Dispatcher thread won't be blocked by PublicLocalizer#addResource . Dispatcher
thread handles most of time critical events at Node manager.
> 2. don't need synchronization on HashMap (pending).
> Because pending will be only accessed in PublicLocalizer thread.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message