hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Payne (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3034) NM should act on a REBOOT command from RM
Date Wed, 08 Feb 2012 21:21:00 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13204001#comment-13204001

Eric Payne commented on MAPREDUCE-3034:


I'm pretty sure that the NodeStatusUpdaterImpl.stop() hierarchy already stops the AppMaster
and Containers on the NM via the AsyncDispatcher event process. I was able to verify this
by examining the code, running tests, and examining the logs.

# Verified by examining the code:
** When the reboot command comes from the RM to the NM, NodeStatusUpdaterImpl.reboot() sets
the isRebooted flag and calls NodeStatusUpdaterImpl.stop().
** NodeStatusUpdaterImpl.stop() (eventually) calls both AbstractService.changeState() and
CompositeService.stop(int numOfServicesStarted). These methods loop through the list of services
registered with them and stop each one.
# Verified by running tests:
** With this change implemented, I started a long-running mapred job and then stopped and
restarted the RM.
** During the interval between stopping and restarting the RM, I took a snapshot of the java
processes running.
** Also, during the interval between stopping and restarting the RM, I searched the NM and
container logs for messages from the AsyncDispatcher to determine if any services were stopped.
None were.
** After restarting the RM, I took another snapshot of the java processes. An examination
indicated that prior to starting the RM, the long-running mapred job was still running with
the MRAppMaster and the container running in YarnChild. After the RM started again, the MRAppMaster
and YarnChild processes were gone.
# Verified by examining logs:
** After running the above test, I did another search through the NM and container logs and
found several services that had been stopped via the AsyncDispatcher event process. Specifically
of interest, the ones from the container {{syslog}} file were these:
*** JobHistoryEventHandler
*** ContainerLauncherImpl
*** MRAppMaster$ContainerLauncherRouter
*** RMCommunicator
*** MRAppMaster$ContainerAllocatorRouter
*** MRClientService
*** TaskCleaner
*** TaskHeartbeatHandler 
*** TaskAttemptListenerImpl
*** Dispatcher
*** MRAppMaster

> NM should act on a REBOOT command from RM
> -----------------------------------------
>                 Key: MAPREDUCE-3034
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3034
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Devaraj K
>            Priority: Critical
>         Attachments: MAPREDUCE-3034-1.patch, MAPREDUCE-3034-2.patch, MAPREDUCE-3034-3.patch,
MAPREDUCE-3034.patch, MR-3034.txt
> RM sends a reboot command to NM in some cases, like when it gets lost and rejoins back.
In such a case, NM should act on the command and reboot/reinitalize itself.
> This is akin to TT reinitialize on order from JT. We will need to shutdown all the services
properly and reinitialize - this should automatically take care of killing of containers,
cleaning up local temporary files etc.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message