hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3449) Recover appTokenKeepAliveMap upon nodemanager restart
Date Mon, 06 Apr 2015 15:48:12 GMT

    [ https://issues.apache.org/jira/browse/YARN-3449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14481330#comment-14481330

Junping Du commented on YARN-3449:

Thanks [~jlowe] for replying with comments!
I didn't quite sure about this. However, from what I learnt from the code, looks like we are
renewing the delegation tokens in RM side for finishing Apps but NM still need them to do
log aggregation. The way NM keep token alive for log aggregation is to send appTokenKeepAliveMap
in heartbeat to RM and keep the time value updated (currentTime + 0.7~0.9 * tokenRemovalDelayMs)
in every heartbeat request/response. If appTokenKeepAliveMap doesn't get recovered after NM
get restarted, then NM will never add these apps in keep alive list (appsToCleanup only sent
once by RM) and RM won't renew the token after the time get expired (based on last heartbeat
request before NM start) because it won't receive any new messages from NM on these apps.

In practical, this issues doesn't appear obviously because tokenRemovalDelayMs is often very
large (10 minutes by default), and very few case that NM cannot finish log aggregation after
this time (even counting NM restart time). However, we should still fix it because it making
behavior of delegation token renewing inconsistent before and after NM restart (and cause
bug at least theoretically). Isn't it?

> Recover appTokenKeepAliveMap upon nodemanager restart
> -----------------------------------------------------
>                 Key: YARN-3449
>                 URL: https://issues.apache.org/jira/browse/YARN-3449
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.6.0, 2.7.0
>            Reporter: Junping Du
>            Assignee: Junping Du
> appTokenKeepAliveMap in NodeStatusUpdaterImpl is used to keep application alive after
application is finished but NM still need app token to do log aggregation (when enable security
and log aggregation). 
> The applications are only inserted into this map when receiving getApplicationsToCleanup()
from RM heartbeat response. And RM only send this info one time in RMNodeImpl.updateNodeHeartbeatResponseForCleanup().
NM restart work preserving should put appTokenKeepAliveMap into NMStateStore and get recovered
after restart. Without doing this, RM could terminate application earlier, so log aggregation
could be failed if security is enabled.

This message was sent by Atlassian JIRA

View raw message