hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Haibo Chen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-6319) race condition between deleting app dir and deleting container dir
Date Fri, 17 Mar 2017 17:47:41 GMT

    [ https://issues.apache.org/jira/browse/YARN-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15930366#comment-15930366
] 

Haibo Chen commented on YARN-6319:
----------------------------------

By linearizing container cleanup and app cleanup, I mean that application cleanup has to wait
for all container cleanup to finish before it can start, i.e., application cleanup can only
happen after the last container cleanup finishes, not to say that container cleanups need
to be done one after another. In cases where deletion threads are occupied/delayed, it can
take some time to finish the last container cleanup task. Again, I don't think this is a dependency
that we need to have. Even though we may potentially need to change two containerExecutors
for option 1, the change should be fairly self-contained and does not change the rest of the
flow. BTW, can you please set the affect version just so that we are talking about the same
version?

> race condition between deleting app dir and deleting container dir
> ------------------------------------------------------------------
>
>                 Key: YARN-6319
>                 URL: https://issues.apache.org/jira/browse/YARN-6319
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Hong Zhiguo
>            Assignee: Hong Zhiguo
>
> Last container (on one node) of one app complete
>     |    --> triggers async deletion of container dir (container cleanup)
>     |    --> triggers async deletion of app dir (app cleanup)
> For LCE, deletion is done by container-executor. The "app cleanup" lists sub-dir (step
1), and then unlink items one by one(step 2).   If a file is deleted by "container cleanup"
between step 1 and step2, it'll report below error and breaks the deletion.
> {code}
> ContainerExecutor: Couldn't delete file $LOCAL/usercache/$USER/appcache/application_1481785469354_353539/container_1481785469354_353539_01_000028/$FILE
- No such file or directory
> {code}
> This app dir then escape the cleanup. And that's why we always have many app dirs left
there.
> solution 1: just ignore the error without breaking in container-executor.c::delete_path()
> solution 2: use a lock to serialize the cleanup of same app dir.
> solution 3: backoff and retry on error
> Comments are welcome.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message