hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Li Lu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3359) Recover collector list in RM failed over
Date Mon, 12 Sep 2016 20:41:20 GMT

    [ https://issues.apache.org/jira/browse/YARN-3359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15485225#comment-15485225

Li Lu commented on YARN-3359:

I've got some offline discussion with [~vinodkv] about this issue. We cannot simply preserve
collector states in the RM state store since this state is not final, and updating this status
frequently will block the RM. A natural replacement place for the state store is the NM state
store. That is to say, we can rebuild RM's collector table by getting updates from the NMs.
In summary, we need to do the following things:

For NMs: 
1. on collector launching, preserve collector address in its state store. 
2. on removing collectors, remove the related item from state store. 
3. on start up, recover collector addresses from state store. 
4. on resync, send current collector address mapping to the RM. 

For RMs, the only change needed is to rebuild the collector/address mapping upon restart.
This actually involves a pretty messy corner case: when one application has two different
attempts running (due to some network problems, for example) and the RM is trying to rebuild
collector status, the RM needs to know which collector is for the latest app attempt and which
one is for the stale attempt. This requires some changes in collector IDs. Right now each
collector is mapped with an app ID, but to handle the state recover case, we need to associate
each collector with an attempt ID (and ideally a time stamp to further distinguish collectors).

Not sure if we missed some critical points in this design. Thoughts? 

> Recover collector list in RM failed over
> ----------------------------------------
>                 Key: YARN-3359
>                 URL: https://issues.apache.org/jira/browse/YARN-3359
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Junping Du
>            Assignee: Li Lu
>              Labels: YARN-5355
> Per discussion in YARN-3039, split the recover work from RMStateStore in a separated

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message