[ https://issues.apache.org/jira/browse/YARN-6846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093909#comment-16093909 ] Jason Lowe commented on YARN-6846: ---------------------------------- Sample log from a 2.8-based release. In this case I believe nftw is returning FTW_NS as a file type since the file was in the directory list but is no longer stat-able because it has been removed by the other container-executor. FTW_NS is not handled by the switch statement in nftw_cb and results in the "Internal error" message. {noformat} 2017-06-30 13:43:47,396 [AsyncDispatcher event handler] INFO application.ApplicationImpl: Application application_1496686551678_5664018 transitioned from FINISHING_CONTAINERS_WAIT to APPLICATION_RESOURCES_CLEANINGUP 2017-06-30 13:43:47,396 [AsyncDispatcher event handler] INFO monitor.ContainersMonitorImpl: Stopping resource-monitoring for container_e03_1496686551678_5664018_01_027791 2017-06-30 13:43:47,396 [AsyncDispatcher event handler] INFO containermanager.AuxServices: Got event CONTAINER_STOP for appId application_1496686551678_5664018 2017-06-30 13:43:47,396 [AsyncDispatcher event handler] INFO yarn.YarnShuffleService: Stopping container container_e03_1496686551678_5664018_01_027791 2017-06-30 13:43:47,396 [AsyncDispatcher event handler] INFO containermanager.AuxServices: Got event APPLICATION_STOP for appId application_1496686551678_5664018 2017-06-30 13:43:47,396 [AsyncDispatcher event handler] INFO yarn.YarnShuffleService: Stopping application application_1496686551678_5664018 2017-06-30 13:43:47,397 [AsyncDispatcher event handler] INFO application.ApplicationImpl: Application application_1496686551678_5664018 transitioned from APPLICATION_RESOURCES_CLEANINGUP to FINISHED 2017-06-30 13:43:47,734 [DeletionService #3] INFO nodemanager.LinuxContainerExecutor: Deleting absolute path : /.../appcache/application_1496686551678_5664018/container_e03_1496686551678_5664018_01_027791 2017-06-30 13:43:47,746 [DeletionService #0] INFO nodemanager.LinuxContainerExecutor: Deleting absolute path : /.../appcache/application_1496686551678_5664018 2017-06-30 13:43:48,990 [DeletionService #0] WARN privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 255. Privileged Execution Operation Output: main : command provided 3 main : run as user is ... main : requested yarn user is ... Internal error deleting /.../appcache/application_1496686551678_5664018/container_e03_1496686551678_5664018_01_027791 Error in nftw while deleting /.../appcache/application_1496686551678_5664018 Couldn't delete directory /.../appcache/application_1496686551678_5664018 - Directory not empty {noformat} The deletion code has changed in 2.9, but I believe it too will fail if files are deleted out from underneath it. Minimally we need to make the deletion more robust to errors, and it should try to delete as much of the directory tree as possible rather than giving up on the first error and leaking the rest of the tree. > Nodemanager can fail to fully delete application local directories when applications are killed > ----------------------------------------------------------------------------------------------- > > Key: YARN-6846 > URL: https://issues.apache.org/jira/browse/YARN-6846 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 2.8.1 > Reporter: Jason Lowe > Priority: Critical > > When an application is killed all of the running containers are killed and the app waits for the containers to complete before cleaning up. As each container completes the container directory is deleted via the DeletionService. After all containers have completed the app completes and the app directory is deleted. If the app completes quickly enough then the deletion of the container and app directories can race against each other. If the container deletion executor deletes a file just before the application deletion executor then it can cause the application deletion executor to fail, leaving the remaining entries in the application directory lingering. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org