hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-6846) Nodemanager can fail to fully delete application local directories when applications are killed
Date Wed, 19 Jul 2017 22:42:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-6846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093909#comment-16093909
] 

Jason Lowe commented on YARN-6846:
----------------------------------

Sample log from a 2.8-based release.  In this case I believe nftw is returning FTW_NS as a
file type since the file was in the directory list but is no longer stat-able because it has
been removed by the other container-executor.  FTW_NS is not handled by the switch statement
in nftw_cb and results in the "Internal error" message.
{noformat}
2017-06-30 13:43:47,396 [AsyncDispatcher event handler] INFO application.ApplicationImpl:
Application application_1496686551678_5664018 transitioned from FINISHING_CONTAINERS_WAIT
to APPLICATION_RESOURCES_CLEANINGUP
2017-06-30 13:43:47,396 [AsyncDispatcher event handler] INFO monitor.ContainersMonitorImpl:
Stopping resource-monitoring for container_e03_1496686551678_5664018_01_027791
2017-06-30 13:43:47,396 [AsyncDispatcher event handler] INFO containermanager.AuxServices:
Got event CONTAINER_STOP for appId application_1496686551678_5664018
2017-06-30 13:43:47,396 [AsyncDispatcher event handler] INFO yarn.YarnShuffleService: Stopping
container container_e03_1496686551678_5664018_01_027791
2017-06-30 13:43:47,396 [AsyncDispatcher event handler] INFO containermanager.AuxServices:
Got event APPLICATION_STOP for appId application_1496686551678_5664018
2017-06-30 13:43:47,396 [AsyncDispatcher event handler] INFO yarn.YarnShuffleService: Stopping
application application_1496686551678_5664018
2017-06-30 13:43:47,397 [AsyncDispatcher event handler] INFO application.ApplicationImpl:
Application application_1496686551678_5664018 transitioned from APPLICATION_RESOURCES_CLEANINGUP
to FINISHED
2017-06-30 13:43:47,734 [DeletionService #3] INFO nodemanager.LinuxContainerExecutor: Deleting
absolute path : /.../appcache/application_1496686551678_5664018/container_e03_1496686551678_5664018_01_027791
2017-06-30 13:43:47,746 [DeletionService #0] INFO nodemanager.LinuxContainerExecutor: Deleting
absolute path : /.../appcache/application_1496686551678_5664018
2017-06-30 13:43:48,990 [DeletionService #0] WARN privileged.PrivilegedOperationExecutor:
Shell execution returned exit code: 255. Privileged Execution Operation Output: 
main : command provided 3
main : run as user is ...
main : requested yarn user is ...
Internal error deleting /.../appcache/application_1496686551678_5664018/container_e03_1496686551678_5664018_01_027791
Error in nftw while deleting /.../appcache/application_1496686551678_5664018
Couldn't delete directory /.../appcache/application_1496686551678_5664018 - Directory not
empty
{noformat}

The deletion code has changed in 2.9, but I believe it too will fail if files are deleted
out from underneath it.  Minimally we need to make the deletion more robust to errors, and
it should try to delete as much of the directory tree as possible rather than giving up on
the first error and leaking the rest of the tree.

> Nodemanager can fail to fully delete application local directories when applications
are killed
> -----------------------------------------------------------------------------------------------
>
>                 Key: YARN-6846
>                 URL: https://issues.apache.org/jira/browse/YARN-6846
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.8.1
>            Reporter: Jason Lowe
>            Priority: Critical
>
> When an application is killed all of the running containers are killed and the app waits
for the containers to complete before cleaning up.  As each container completes the container
directory is deleted via the DeletionService.  After all containers have completed the app
completes and the app directory is deleted.  If the app completes quickly enough then the
deletion of the container and app directories can race against each other.  If the container
deletion executor deletes a file just before the application deletion executor then it can
cause the application deletion executor to fail, leaving the remaining entries in the application
directory lingering.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message