hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-244) Application Master Retries fail due to FileNotFoundException
Date Mon, 26 Nov 2012 15:19:01 GMT

    [ https://issues.apache.org/jira/browse/YARN-244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13503832#comment-13503832
] 

Jason Lowe commented on YARN-244:
---------------------------------

could you provide a bit more detail from the AM logs when this occurs?  I'm not able to reproduce
this with a sleep job and manually killing the AM to simulate failure.  Normally the AM tries
to determine if it is the last attempt and only deletes the files if it is convinced there
will be more attempts.  If you could provide steps to reproduce or details from the AM logs
showing why it decided to remove the staging directory that would help clarify what's going
on in this case.
                
> Application Master Retries fail due to FileNotFoundException
> ------------------------------------------------------------
>
>                 Key: YARN-244
>                 URL: https://issues.apache.org/jira/browse/YARN-244
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: applications
>    Affects Versions: 2.0.2-alpha, 2.0.1-alpha
>            Reporter: Devaraj K
>            Assignee: Devaraj K
>            Priority: Blocker
>
> Application attempt1 is deleting the job related files and these are not present in the
HDFS for following retries.
> {code:xml}
> Application application_1353724754961_0001 failed 4 times due to AM Container for appattempt_1353724754961_0001_000004
exited with exitCode: -1000 due to: RemoteTrace: java.io.FileNotFoundException: File does
not exist: hdfs://hacluster:8020/tmp/hadoop-yarn/staging/mapred/.staging/job_1353724754961_0001/appTokens
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:752)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:88) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:49)
at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:157) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:155)
at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at
org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:153) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:49)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)
at LocalTrace: org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: File does
not exist: hdfs://hacluster:8020/tmp/hadoop-yarn/staging/mapred/.staging/job_1353724754961_0001/appTokens
at org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217)
at org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:822)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.processHeartbeat(ResourceLocalizationService.java:492)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.heartbeat(ResourceLocalizationService.java:221)
at org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.service.LocalizationProtocolPBServiceImpl.heartbeat(LocalizationProtocolPBServiceImpl.java:46)
at org.apache.hadoop.yarn.proto.LocalizationProtocol$LocalizationProtocolService$2.callBlockingMethod(LocalizationProtocol.java:57)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:427)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:924) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1692)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1688) at java.security.AccessController.doPrivileged(Native
Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1686) .Failing this attempt.. Failing
the application. 
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message