hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mahadev konar (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-3233) AM fails to restart when first AM is killed
Date Sat, 22 Oct 2011 02:20:32 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mahadev konar updated MAPREDUCE-3233:
-------------------------------------

    Status: Patch Available  (was: Open)

Verified the patch on a secure cluster, killed the AM, it came up, started running the job
again. There is an issues with continuous logging at the client side (on AM restart) we need
to get rid of. Ill open a different jira for that.
                
> AM fails to restart when first AM is killed
> -------------------------------------------
>
>                 Key: MAPREDUCE-3233
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3233
>             Project: Hadoop Map/Reduce
>          Issue Type: Sub-task
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Karam Singh
>            Assignee: Mahadev konar
>            Priority: Blocker
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-3233.patch
>
>
> Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn cluster.
> Sumbitted Sleep Job of 100K maps tasks as following -:
> $HADOOP_COMMON_HOME/bin/hadoop jar $HADOOP_MAPRED_HOME/hadoop-test.jar sleep -m 100000
-r 0 -mt 1000 -rt 1000
> when around 53K tasks go, login node running AppMaster, and killed AppMaster with kill
-9
> Resource Manager tried restart AM uptio max-retris but failed with following -:
> {code}
> 11/10/19 15:29:09 INFO mapreduce.Job: Job job_1319036155027_0002 failed with state FAILED
due to: Application
> application_1319036155027_0002 failed 5 times due to AM Container for appattempt_1319036155027_0002_000005
exited with 
> exitCode: -1000 due to: RemoteTrace: 
> java.io.IOException: Resource
> hdfs://<NN>:<PORT>/user/<JOBUSER>/.staging/job_1319036155027_0002/appTokens
changed on src
> filesystem (expected 1319037705427, was 1319037714496
>             at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.FSDownload.copy(FSDownload.java:80)
>             at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.FSDownload.access$000(FSDownload.java:49)
>             at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.FSDownload$1.run(FSDownload.java:149)
>             at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.FSDownload$1.run(FSDownload.java:147)
>             at java.security.AccessController.doPrivileged(Native Method)
>             at javax.security.auth.Subject.doAs(Subject.java:396)
>             at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1152)
>             at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.FSDownload.call(FSDownload.java:145)
>             at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.FSDownload.call(FSDownload.java:49)
>             at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>             at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>             at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>             at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>             at java.lang.Thread.run(Thread.java:619)
>  at LocalTrace: 
>             org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Resource
> hdfs://<NN>:<PORT>/user/<JOBUSER>/.staging/job_1319036155027_0002/appTokens
changed on src
> filesystem (expected 1319037705427, was 1319037714496
>             at
> org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217)
>             at
> org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147)
>             at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:798)
>             at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.processHeartbeat(ResourceLocalizationService.java:483)
>             at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.heartbeat(ResourceLocalizationService.java:228)
>             at
> org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.service.LocalizationProtocolPBServiceImpl.heartbeat(LocalizationProtocolPBServiceImpl.java:46)
>             at
> org.apache.hadoop.yarn.proto.LocalizationProtocol$LocalizationProtocolService$2.callBlockingMethod(LocalizationProtocol.java:57)
>             at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Server.call(ProtoOverHadoopRpcEngine.java:343)
>             at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1486)
>             at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1482)
>             at java.security.AccessController.doPrivileged(Native Method)
>             at javax.security.auth.Subject.doAs(Subject.java:396)
>             at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1152)
>             at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1480)
> .Failing this attempt.. Failing the application.
> 11/10/19 15:29:09 INFO mapreduce.Job: Counters: 0
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message