hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Payne (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3034) NM should act on a REBOOT command from RM
Date Fri, 27 Jan 2012 15:48:10 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194862#comment-13194862
] 

Eric Payne commented on MAPREDUCE-3034:
---------------------------------------

@Devaraj

That's fine if you want to take it over. When do you think you can get a patch up? I was hoping
to get this going within the next week.

>From my point of view, the basic requirement is to be able to bounce the RM without having
to manually star every single NM in a very large cluster (thousands of NMs).

Right now, when NM gets the reboot command from the RM, it just calls the stop hooks, just
like if it gets a shutdown command. My plan is that if NM gets reboot command, it still executes
the shutdown hook, but then add a reboot hook that executes the same basic code as was done
to begin with in NameNode.main(). Is that your basic plan?

I have already written up a "proof-of-concept" patch and tested it in a 10-node secure cluster.
To test it, I shutdown RM and restarted it. After the restart, I ran an hour's worth of jobs
and compared the time and heap size from before and after. They all looked good to me.

Thanks,
-Eric
                
> NM should act on a REBOOT command from RM
> -----------------------------------------
>
>                 Key: MAPREDUCE-3034
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3034
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Devaraj K
>         Attachments: MR-3034.txt
>
>
> RM sends a reboot command to NM in some cases, like when it gets lost and rejoins back.
In such a case, NM should act on the command and reboot/reinitalize itself.
> This is akin to TT reinitialize on order from JT. We will need to shutdown all the services
properly and reinitialize - this should automatically take care of killing of containers,
cleaning up local temporary files etc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message