hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jian He (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-993) job can not recovery after restart resourcemanager
Date Tue, 30 Jul 2013 04:25:48 GMT

    [ https://issues.apache.org/jira/browse/YARN-993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13723385#comment-13723385
] 

Jian He commented on YARN-993:
------------------------------

[~prophy999] if you are running 2.0.5-alpha, to test RM restart, after you submit the job,
you need to manually ctrl-c the command line after you see the message saying job is submitted,
since MR will clean up the staging dir if RM is not available.
this problem has been fixed in YARN-513.
                
> job can not recovery after restart resourcemanager
> --------------------------------------------------
>
>                 Key: YARN-993
>                 URL: https://issues.apache.org/jira/browse/YARN-993
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.0.5-alpha
>         Environment: CentOS5.3 JDK1.7.0_11
>            Reporter: prophy Yan
>            Priority: Critical
>
> Recently, i have test the function job recovery in the YARN framework, but it failed.
> first, i run the wordcount example program, and the i kill -9 the resourcemanager process
on the server when the wordcount process in map 100%.
> the job will exit with error in minutes.
> second, i restart the resourcemanager on the server by user the 'start-yarn.sh' command.
but, the failed job(wordcount) can not to continue.
> the yarn log says "file not exist!"
> Here is the YARN log:
> 013-07-23 16:05:21,472 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher:
Done launching container Container: [ContainerId: container_1374564764970_0001_02_000001,
NodeId: mv8.mzhen.cn:52117, NodeHttpAddress: mv8.mzhen.cn:8042, Resource: <memory:2048,
vCores:1>, Priority: 0, State: NEW, Token: null, Status: container_id {, app_attempt_id
{, application_id {, id: 1, cluster_timestamp: 1374564764970, }, attemptId: 2, }, id: 1, },
state: C_NEW, ] for AM appattempt_1374564764970_0001_000002
> 2013-07-23 16:05:21,473 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1374564764970_0001_000002 State change from ALLOCATED to LAUNCHED
> 2013-07-23 16:05:21,925 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1374564764970_0001_000002 State change from LAUNCHED to FAILED
> 2013-07-23 16:05:21,925 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
Application application_1374564764970_0001 failed 1 times due to AM Container for appattempt_1374564764970_0001_000002
exited with  exitCode: -1000 due to: RemoteTrace:
> java.io.FileNotFoundException: File does not exist: hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens
>         at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:815)
>         at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176)
>         at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:51)
>         at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:284)
>         at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:282)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
>         at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:280)
>         at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:51)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>         at java.lang.Thread.run(Thread.java:722)
>  at LocalTrace:
>         org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: File does
not exist: hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens
> at org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217)
>         at org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:819)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.processHeartbeat(ResourceLocalizationService.java:491)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.heartbeat(ResourceLocalizationService.java:218)
>         at org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.service.LocalizationProtocolPBServiceImpl.heartbeat(LocalizationProtocolPBServiceImpl.java:46)
>         at org.apache.hadoop.yarn.proto.LocalizationProtocol$LocalizationProtocolService$2.callBlockingMethod(LocalizationProtocol.java:57)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
> .Failing this attempt.. Failing the application.
> 2013-07-23 16:05:21,935 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
application_1374564764970_0001 State change from ACCEPTED to FAILED
> 2013-07-23 16:05:21,937 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger:
USER=supertool        OPERATION=Application Finished - Failed TARGET=RMAppManager     RESULT=FAILURE
 DESCRIPTION=App failed with state: FAILED       PERMISSIONS=Application application_1374564764970_0001
failed 1 times due to AM Container for appattempt_1374564764970_0001_000002 exited with  exitCode:
-1000 due to: RemoteTrace:
> java.io.FileNotFoundException: File does not exist: hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens
> this is the log in YARN-logfile after i restart the resourcemanager

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message