hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mahadev konar (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-3186) User jobs are getting hanged if the Resource manager process goes down and comes up while job is getting executed.
Date Thu, 27 Oct 2011 02:27:35 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mahadev konar updated MAPREDUCE-3186:
-------------------------------------

    Status: Open  (was: Patch Available)

@Eric,
 Looked at the patch, a minor comment:

{noformat}
 retrycount = getConfig().getInt(MRJobConfig.MR_AM_TO_RM_RETRIES,
+                                       MRJobConfig.DEFAULT_MR_AM_TO_RM_RETRIES);
{noformat}

We should probably avoid reading the config entry everytime we call get resources. The maxretry
can be inited in init() call.

Regarding the test, you should be able to mock failure the communicate with the RM and make
sure that an internal error is generated. Also, if the MRApp shutsdown on an internal error.
                
> User jobs are getting hanged if the Resource manager process goes down and comes up while
job is getting executed.
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3186
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3186
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>         Environment: linux
>            Reporter: Ramgopal N
>            Assignee: Eric Payne
>            Priority: Blocker
>              Labels: test
>         Attachments: MAPREDUCE-3186.v1.txt
>
>
> If the resource manager is restarted while the job execution is in progress, the job
is getting hanged.
> UI shows the job as running.
> In the RM log, it is throwing an error "ERROR org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:
AppAttemptId doesnt exist in cache appattempt_1318579738195_0004_000001"
> In the console MRAppMaster and Runjar processes are not getting killed

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message