hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Payne (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3034) NM should act on a REBOOT command from RM
Date Fri, 27 Jan 2012 15:48:10 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194862#comment-13194862

Eric Payne commented on MAPREDUCE-3034:


That's fine if you want to take it over. When do you think you can get a patch up? I was hoping
to get this going within the next week.

>From my point of view, the basic requirement is to be able to bounce the RM without having
to manually star every single NM in a very large cluster (thousands of NMs).

Right now, when NM gets the reboot command from the RM, it just calls the stop hooks, just
like if it gets a shutdown command. My plan is that if NM gets reboot command, it still executes
the shutdown hook, but then add a reboot hook that executes the same basic code as was done
to begin with in NameNode.main(). Is that your basic plan?

I have already written up a "proof-of-concept" patch and tested it in a 10-node secure cluster.
To test it, I shutdown RM and restarted it. After the restart, I ran an hour's worth of jobs
and compared the time and heap size from before and after. They all looked good to me.

> NM should act on a REBOOT command from RM
> -----------------------------------------
>                 Key: MAPREDUCE-3034
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3034
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Devaraj K
>         Attachments: MR-3034.txt
> RM sends a reboot command to NM in some cases, like when it gets lost and rejoins back.
In such a case, NM should act on the command and reboot/reinitalize itself.
> This is akin to TT reinitialize on order from JT. We will need to shutdown all the services
properly and reinitialize - this should automatically take care of killing of containers,
cleaning up local temporary files etc.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message