hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karthik Kambatla (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3464) Race condition in LocalizerRunner causes container localization timeout.
Date Wed, 08 Apr 2015 18:14:13 GMT

    [ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485687#comment-14485687
] 

Karthik Kambatla commented on YARN-3464:
----------------------------------------

bq. Looking at the code closely, I don't see any resources being removed from pending. So,
pending shouldn't be empty after some of the resources have been downloaded.
Never mind. findNextResource has a call to iterator.remove().

In any case, I think the right approach is to send an explicit event to the localizer to indicate
we are done with localizing all the resources. On receiving this, the localizer tracker sends
the DIE action.

> Race condition in LocalizerRunner causes container localization timeout.
> ------------------------------------------------------------------------
>
>                 Key: YARN-3464
>                 URL: https://issues.apache.org/jira/browse/YARN-3464
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>            Priority: Critical
>
> Race condition in LocalizerRunner causes container localization timeout.
> Currently LocalizerRunner will kill the ContainerLocalizer when pending list for LocalizerResourceRequestEvent
is empty.
> {code}
>       } else if (pending.isEmpty()) {
>         action = LocalizerAction.DIE;
>       }
> {code}
> If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the ContainerLocalizer
due to empty pending list, this LocalizerResourceRequestEvent will never be handled.
> Without ContainerLocalizer, LocalizerRunner#update will never be called.
> The container will stay at LOCALIZING state, until the container is killed by AM due
to TASK_TIMEOUT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message