hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jian He (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-5441) JobClient exit whenever RM issue Reboot command to 1st attempt App Master.
Date Wed, 28 Aug 2013 00:54:52 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13751973#comment-13751973

Jian He commented on MAPREDUCE-5441:

Thanks [~rohithsharma] for reporting this problem.

Earlier this problem is not easily reproduced on my side because at that time MR choose to
ignore the Invalid AMRMToken exception after RM restarts and never explicitly sends the JOB_AM_REBOOT
event and keeps alive until signally killed by NM. After that JobClient can just quickly switch
to the new AM.

Now MR is changed to explicitly send the JOB_AM_REBOOT event in case of Invalid AMRMToken
exception(should be fixed later) and JobClient has more chance to see the ERROR state of the
JOB which leads JobClient to exit prematurely.
Reproduced this problem by putting long sleep in MRAppMaster.showDownJob() for the normal
shutDown and MRAppMasterShutdownHook in case of signally shutDown, so that JobClient has great
chance to see the ERROR state.

Uploaded a patch that in case of REBOOT state of the Job return the external state as RUNNING
to prevent JobClient from prematurely exiting
The above manual test passed with the patch and failed without.
> JobClient exit whenever RM issue Reboot command to 1st attempt App Master.
> --------------------------------------------------------------------------
>                 Key: MAPREDUCE-5441
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5441
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, client
>    Affects Versions: 2.1.0-beta, 2.0.5-alpha, 2.1.1-beta
>            Reporter: Rohith Sharma K S
>            Assignee: Jian He
>         Attachments: MAPREDUCE-5441.patch
> When RM issue Reboot command to app master, app master shutdown gracefully. All the history
event are writtent to hdfs with job status set as ERROR. Jobclient get job state as ERROR
and exit. 
> But RM launches 2nd attempt app master where no client are there to get job status.In
RM UI, job status is displayed as SUCCESS but for client Job is Failed.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message