hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anubhav Dhoot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1366) ApplicationMasterService should Resync with the AM upon allocate call after restart
Date Fri, 16 May 2014 11:08:21 GMT

    [ https://issues.apache.org/jira/browse/YARN-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13999451#comment-13999451
] 

Anubhav Dhoot commented on YARN-1366:
-------------------------------------

Seems like we are going with no resync api for now as per the current patch. I think its a
good idea to hold of on the new API unless we see a need. I feel there isnt a strong case
for it yet.

There a few issues i see which will need a little more work.
Pending releases - AM forgets about a request to release once its made. We will have to reissue
a release request after RM restart  to be safe (add also make sure RM can handle a duplicate
of that). Otherwise we have a resource leak if RM has not issued the release before it restarted.
One way is to remember all releases in a new  Set<ContainerId> pendingReleases in RMContainerRequestor
and remove it by processing the getCompletedContainersStatuses in makeRemoteRequest or a new
function that it calls.

{code}
+    blacklistAdditions.addAll(blacklistedNodes);
{code}
Blacklisting has logic in ignoreBlacklisting to ignore it if we cross a threshold. So we can
do

{code}
if (!ignoreBlacklisting.get()) {
   blacklistAdditions.addAll(blacklistedNodes);
}
{code}

There a few places where the line exceeds 80 chars.
Otherwise it looks good.
Lets add some tests and validate this.




 

> ApplicationMasterService should Resync with the AM upon allocate call after restart
> -----------------------------------------------------------------------------------
>
>                 Key: YARN-1366
>                 URL: https://issues.apache.org/jira/browse/YARN-1366
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Bikas Saha
>            Assignee: Rohith
>         Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, YARN-1366.prototype.patch,
YARN-1366.prototype.patch
>
>
> The ApplicationMasterService currently sends a resync response to which the AM responds
by shutting down. The AM behavior is expected to change to calling resyncing with the RM.
Resync means resetting the allocate RPC sequence number to 0 and the AM should send its entire
outstanding request to the RM. Note that if the AM is making its first allocate call to the
RM then things should proceed like normal without needing a resync. The RM will return all
containers that have completed since the RM last synced with the AM. Some container completions
may be reported more than once.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message