hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Haibo Chen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-6319) race condition between deleting app dir and deleting container dir
Date Wed, 15 Mar 2017 19:21:41 GMT

    [ https://issues.apache.org/jira/browse/YARN-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15926799#comment-15926799

Haibo Chen commented on YARN-6319:

[~zhiguohong] Sorry for my misunderstanding of the issue. So this is a problem for both LCE
and DefautContainerExecutor?
I suppose the callback only needs to be added to the last container cleanup task, which immediately
requires some special case checking code. In addition, relying the callback can also lead
to the possibility of leaving out application dir. Say for some reason, the deletion call
back is not executed, then we are gonna have the application folder not cleaned up. Having
an independent and robust deletion task for application folder will serve as a safety net,
should any container dir cleanup fail. 

> race condition between deleting app dir and deleting container dir
> ------------------------------------------------------------------
>                 Key: YARN-6319
>                 URL: https://issues.apache.org/jira/browse/YARN-6319
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Hong Zhiguo
>            Assignee: Hong Zhiguo
> Last container (on one node) of one app complete
>     |    --> triggers async deletion of container dir (container cleanup)
>     |    --> triggers async deletion of app dir (app cleanup)
> For LCE, deletion is done by container-executor. The "app cleanup" lists sub-dir (step
1), and then unlink items one by one(step 2).   If a file is deleted by "container cleanup"
between step 1 and step2, it'll report below error and breaks the deletion.
> {code}
> ContainerExecutor: Couldn't delete file $LOCAL/usercache/$USER/appcache/application_1481785469354_353539/container_1481785469354_353539_01_000028/$FILE
- No such file or directory
> {code}
> This app dir then escape the cleanup. And that's why we always have many app dirs left
> solution 1: just ignore the error without breaking in container-executor.c::delete_path()
> solution 2: use a lock to serialize the cleanup of same app dir.
> solution 3: backoff and retry on error
> Comments are welcome.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message