Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1B03D116C8 for ; Fri, 19 Sep 2014 02:36:34 +0000 (UTC) Received: (qmail 94447 invoked by uid 500); 19 Sep 2014 02:36:34 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 94420 invoked by uid 500); 19 Sep 2014 02:36:33 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 94409 invoked by uid 99); 19 Sep 2014 02:36:33 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Sep 2014 02:36:33 +0000 Date: Fri, 19 Sep 2014 02:36:33 +0000 (UTC) From: "zhihai xu (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14139894#comment-14139894 ] zhihai xu commented on YARN-2566: --------------------------------- I attached a patch YARN-2566.000.patch for review. I have a test case in the patch which need Mock the FileContext class, so I need remove final in FileContext class. > IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir. > --------------------------------------------------------------------------------------------------------------------- > > Key: YARN-2566 > URL: https://issues.apache.org/jira/browse/YARN-2566 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 2.5.0 > Reporter: zhihai xu > Assignee: zhihai xu > Attachments: YARN-2566.000.patch > > > startLocalizer in DefaultContainerExecutor will only use the first localDir to copy the token file, if the copy is failed for first localDir due to not enough disk space in the first localDir, the localization will be failed even there are plenty of disk space in other localDirs. We see the following error for this case: > {code} > 2014-09-13 23:33:25,171 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to create app directory /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 > java.io.IOException: mkdir of /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed > at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062) > at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157) > at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) > at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721) > at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717) > at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426) > at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522) > at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) > 2014-09-13 23:33:25,185 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer failed > java.io.FileNotFoundException: File file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 does not exist > at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) > at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) > at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) > at org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) > at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76) > at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.(ChecksumFs.java:344) > at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) > at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577) > at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677) > at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.create(FileContext.java:673) > at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021) > at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963) > at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) > 2014-09-13 23:33:25,186 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1410663092546_0004_01_000001 transitioned from LOCALIZING to LOCALIZATION_FAILED > 2014-09-13 23:33:25,187 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: LOCALIZATION_FAILED APPID=application_1410663092546_0004 CONTAINERID=container_1410663092546_0004_01_000001 > 2014-09-13 23:33:25,187 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1410663092546_0004_01_000001 transitioned from LOCALIZATION_FAILED to DONE > 2014-09-13 23:33:25,187 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Removing container_1410663092546_0004_01_000001 from application application_1410663092546_0004 > 2014-09-13 23:33:25,187 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Considering container container_1410663092546_0004_01_000001 for log-aggregation > 2014-09-13 23:33:25,187 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_STOP for appId application_1410663092546_0004 > 2014-09-13 23:33:25,187 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001 > 2014-09-13 23:33:25,187 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete returned false for path: [/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001] > 2014-09-13 23:33:25,188 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /hadoop/d2/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001 > 2014-09-13 23:33:25,188 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: delete returned false for path: [/hadoop/d2/usercache/cloudera/appcache/application_1410663092546_0004/container_1410663092546_0004_01_000001] > 2014-09-13 23:33:25,291 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Stopping resource-monitoring for container_1410663092546_0004_01_000001 > 2014-09-13 23:33:26,159 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed completed container container_1410663092546_0004_01_000001 > {code} > The correct way to do is If the IOException happened during the copy, try the next the localDir, If all the localDirs are failed to copy, then throw a exception. > I will create a patch to fix this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)