hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Payne (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3034) NM should act on a REBOOT command from RM
Date Wed, 08 Feb 2012 21:21:00 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13204001#comment-13204001
] 

Eric Payne commented on MAPREDUCE-3034:
---------------------------------------

@Arun,

I'm pretty sure that the NodeStatusUpdaterImpl.stop() hierarchy already stops the AppMaster
and Containers on the NM via the AsyncDispatcher event process. I was able to verify this
by examining the code, running tests, and examining the logs.

# Verified by examining the code:
** When the reboot command comes from the RM to the NM, NodeStatusUpdaterImpl.reboot() sets
the isRebooted flag and calls NodeStatusUpdaterImpl.stop().
** NodeStatusUpdaterImpl.stop() (eventually) calls both AbstractService.changeState() and
CompositeService.stop(int numOfServicesStarted). These methods loop through the list of services
registered with them and stop each one.
# Verified by running tests:
** With this change implemented, I started a long-running mapred job and then stopped and
restarted the RM.
** During the interval between stopping and restarting the RM, I took a snapshot of the java
processes running.
** Also, during the interval between stopping and restarting the RM, I searched the NM and
container logs for messages from the AsyncDispatcher to determine if any services were stopped.
None were.
** After restarting the RM, I took another snapshot of the java processes. An examination
indicated that prior to starting the RM, the long-running mapred job was still running with
the MRAppMaster and the container running in YarnChild. After the RM started again, the MRAppMaster
and YarnChild processes were gone.
# Verified by examining logs:
** After running the above test, I did another search through the NM and container logs and
found several services that had been stopped via the AsyncDispatcher event process. Specifically
of interest, the ones from the container {{syslog}} file were these:
*** JobHistoryEventHandler
*** ContainerLauncherImpl
*** MRAppMaster$ContainerLauncherRouter
*** RMCommunicator
*** MRAppMaster$ContainerAllocatorRouter
*** MRClientService
*** TaskCleaner
*** TaskHeartbeatHandler 
*** TaskAttemptListenerImpl
*** Dispatcher
*** MRAppMaster

                
> NM should act on a REBOOT command from RM
> -----------------------------------------
>
>                 Key: MAPREDUCE-3034
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3034
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Devaraj K
>            Priority: Critical
>         Attachments: MAPREDUCE-3034-1.patch, MAPREDUCE-3034-2.patch, MAPREDUCE-3034-3.patch,
MAPREDUCE-3034.patch, MR-3034.txt
>
>
> RM sends a reboot command to NM in some cases, like when it gets lost and rejoins back.
In such a case, NM should act on the command and reboot/reinitalize itself.
> This is akin to TT reinitialize on order from JT. We will need to shutdown all the services
properly and reinitialize - this should automatically take care of killing of containers,
cleaning up local temporary files etc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message