Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5CAEB18B17 for ; Wed, 30 Sep 2015 16:40:04 +0000 (UTC) Received: (qmail 42038 invoked by uid 500); 30 Sep 2015 16:39:20 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 41999 invoked by uid 500); 30 Sep 2015 16:39:20 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 41955 invoked by uid 99); 30 Sep 2015 16:39:20 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Sep 2015 16:39:20 +0000 Date: Wed, 30 Sep 2015 16:39:20 +0000 (UTC) From: "Hudson (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-3727) For better error recovery, check if the directory exists before using it for localization. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14937625#comment-14937625 ] Hudson commented on YARN-3727: ------------------------------ FAILURE: Integrated in Hadoop-Yarn-trunk #1203 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/1203/]) YARN-3727. For better error recovery, check if the directory exists (jlowe: rev 854d25b0c30fd40f640c052e79a8747741492042) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestLocalResourcesTrackerImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTracker.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/TestResourceLocalizationService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/LocalResourcesTrackerImpl.java > For better error recovery, check if the directory exists before using it for localization. > ------------------------------------------------------------------------------------------ > > Key: YARN-3727 > URL: https://issues.apache.org/jira/browse/YARN-3727 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager > Affects Versions: 2.7.0 > Reporter: zhihai xu > Assignee: zhihai xu > Fix For: 2.7.2, 2.6.2 > > Attachments: YARN-3727.000.patch, YARN-3727.001.patch > > > For better error recovery, check if the directory exists before using it for localization. > We saw the following localization failure happened due to existing cache directories. > {code} > 2015-05-11 18:59:59,756 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: DEBUG: FAILED { hdfs://XXXX/XXXXX/libjars/1234.jar, 1431395961545, FILE, null }, Rename cannot overwrite non empty destination directory /XXXX/8/yarn/nm/usercache/XXXX/filecache/21637 > 2015-05-11 18:59:59,756 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://XXXX/XXXXX/libjars/1234.jar(->/XXXX/8/yarn/nm/usercache/XXXX/filecache/21637/1234.jar) transitioned from DOWNLOADING to FAILED > {code} > The real cause for this failure may be disk failure, LevelDB operation failure for {{startResourceLocalization}}/{{finishResourceLocalization}} or others. > I wonder whether we can add error recovery code to avoid the localization failure by not using the existing cache directories for localization. > The exception happened at {{files.rename(dst_work, destDirPath, Rename.OVERWRITE)}} in FSDownload#call. Based on the following code, after the exception, the existing cache directory used by {{LocalizedResource}} will be deleted. > {code} > try { > ......... > files.rename(dst_work, destDirPath, Rename.OVERWRITE); > } catch (Exception e) { > try { > files.delete(destDirPath, true); > } catch (IOException ignore) { > } > throw e; > } finally { > {code} > Since the conflicting local directory will be deleted after localization failure, > I think it will be better to check if the directory exists before using it for localization to avoid the localization failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)