hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Payne (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3186) User jobs are getting hanged if the Resource manager process goes down and comes up while job is getting executed.
Date Thu, 27 Oct 2011 00:35:32 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13136641#comment-13136641
] 

Eric Payne commented on MAPREDUCE-3186:
---------------------------------------

Problems being solved and their solutions:

# +When an application is running and the RM goes down, the MRAppMaster loops forever.+
Changes were made to {{RMContainerAllocator::getResources()}} to attempt to make contact with
RM a certain number of times. The number of retries is based on {{MRJobConfig.MR_AM_TO_RM_RETRIES}},
which property name is {{yarn.app.mapreduce.am.scheduler.connection.retries}}.
??This is a new yarn config property??.
If contact with the RM fails the specified number of times, {{RMContainerAllocator::getResources()}}
will generate an INTERNAL_ERROR event and will throw a YarnException, which will be caught
by {{RMCommunicator::AllocatorThread}} and cause that thread to exit.
# When the RM is stopped and restarted, the MRAppMaster does not honor the "shouldreboot"
flag sent from the RM and keeps attempting to connect with the new RM.
Changes were made to {{RMContainerAllocator::getResources()}} to check the reboot rlag in
the response from the call to {{makeRemoteRequest()}}. If the reboot flag is set, {{RMContainerAllocator::getResources()}}
will generate an INTERNAL_ERROR event and will throw a YarnException which is caught by {{RMCommunicator::AllocatorThread}}
and cause that thread to exit.

                
> User jobs are getting hanged if the Resource manager process goes down and comes up while
job is getting executed.
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3186
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3186
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>         Environment: linux
>            Reporter: Ramgopal N
>            Assignee: Eric Payne
>            Priority: Blocker
>              Labels: test
>
> If the resource manager is restarted while the job execution is in progress, the job
is getting hanged.
> UI shows the job as running.
> In the RM log, it is throwing an error "ERROR org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:
AppAttemptId doesnt exist in cache appattempt_1318579738195_0004_000001"
> In the console MRAppMaster and Runjar processes are not getting killed

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message