hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3730) Allow restarted NM to rejoin cluster before RM expires it
Date Fri, 27 Jan 2012 16:40:10 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194902#comment-13194902

Jason Lowe commented on MAPREDUCE-3730:

bq. Also, instead of using a new RMNodeImpl, you can simply send a RECONNECTED event to the
existing RMNodeImpl.

There are a few reasons I thought using a new RMNodeImpl would be better:

1) The new node could have different capabilities than the previous node (e.g.: more RAM).
2) Reusing the node means all states have to handle the RECONNECTED event.  It seemed simpler
from a maintenance standpoint to treat the RECONNECTED event as an accelerated EXPIRE/STARTED
transition since that's effectively what occurred.  The RM just didn't get a chance to notice
the EXPIRE event because the timeout was too large in that instance.

bq.  I think we should also not process the health status during registration. That will happen
anyways in the next status update, right?

Following the "accelerated EXPIRE/STARTED" logic, it seemed better to assume a reconnected
node is just like a newly connected node (i.e.: we assume healthy until proven otherwise).
 As you say, any incorrect guess there will be corrected on the next status update.

> Allow restarted NM to rejoin cluster before RM expires it
> ---------------------------------------------------------
>                 Key: MAPREDUCE-3730
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3730
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mrv2, resourcemanager
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: MAPREDUCE-3730.patch, MAPREDUCE-3730.patch
> When a node in the RUNNING state (healthy or unhealthy) is rebooted, the resourcemanager
rejects the nodemanager's registration request as a duplicate because it is convinced that
the nodemanager is already running on that node.  It won't allow that node to rejoin the cluster
until the node expiration time elapses which is 10min+ by default.  We should allow the NM
to rejoin the cluster if it re-registers within the expiration timeout.
> Note that this problem occurs with NMs that are configured to specific ports.  If ephemeral
ports are used then a NM reboot "works" because the RM thinks the NM registration is for a
new node.  See the discussions in MAPREDUCE-3070 and MAPREDUCE-3363.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message