hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhihai xu (JIRA)" <j...@apache.org>
Subject [jira] [Created] (YARN-3727) For better error recovery, check if the directory exists before using it for localization.
Date Thu, 28 May 2015 01:53:17 GMT
zhihai xu created YARN-3727:
-------------------------------

             Summary: For better error recovery, check if the directory exists before using
it for localization.
                 Key: YARN-3727
                 URL: https://issues.apache.org/jira/browse/YARN-3727
             Project: Hadoop YARN
          Issue Type: Improvement
          Components: nodemanager
    Affects Versions: 2.7.0
            Reporter: zhihai xu
            Assignee: zhihai xu


For better error recovery, check if the directory exists before using it for localization.
We saw the following localization failure happened due to existing cache directories.
{code}
2015-05-11 18:59:59,756 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
DEBUG: FAILED { hdfs://XXXX/XXXXX/libjars/1234.jar, 1431395961545, FILE, null }, Rename cannot
overwrite non empty destination directory /XXXX/8/yarn/nm/usercache/XXXX/filecache/21637
2015-05-11 18:59:59,756 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
Resource hdfs://XXXX/XXXXX/libjars/1234.jar(->/XXXX/8/yarn/nm/usercache/XXXX/filecache/21637/1234.jar)
transitioned from DOWNLOADING to FAILED
{code}

The real cause for this failure may be disk failure, LevelDB operation failure for {{startResourceLocalization}}/{{finishResourceLocalization}}
or others.

I wonder whether we can add error recovery code to avoid the localization failure by not using
the existing cache directories for localization.

The exception happened at {{files.rename(dst_work, destDirPath, Rename.OVERWRITE)}} in FSDownload#call.
Based on the following code, after the exception, the existing cache directory used by {{LocalizedResource}}
will be deleted.
{{code}}
try {
     .........
      files.rename(dst_work, destDirPath, Rename.OVERWRITE);
    } catch (Exception e) {
      try {
        files.delete(destDirPath, true);
      } catch (IOException ignore) {
      }
      throw e;
    } finally {
{{code}}

Since the conflicting local directory will be deleted after localization failure,
I think it will be better to check if the directory exists before using it for localization
to avoid the localization failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message