Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4A89C9E9C for ; Fri, 21 Oct 2011 02:38:59 +0000 (UTC) Received: (qmail 82743 invoked by uid 500); 21 Oct 2011 02:38:59 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 82615 invoked by uid 500); 21 Oct 2011 02:38:58 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 82607 invoked by uid 99); 21 Oct 2011 02:38:56 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Oct 2011 02:38:56 +0000 X-ASF-Spam-Status: No, hits=-2000.5 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Oct 2011 02:38:53 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 4499E3149AE for ; Fri, 21 Oct 2011 02:36:33 +0000 (UTC) Date: Fri, 21 Oct 2011 02:36:33 +0000 (UTC) From: "Mahadev konar (Commented) (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: <253317322.56.1319164593282.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1272939297.15951.1319127791199.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (MAPREDUCE-3233) AM fails to restart when first AM is killed MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/MAPREDUCE-3233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13132285#comment-13132285 ] Mahadev konar commented on MAPREDUCE-3233: ------------------------------------------ Ok found the issue. Here's the problem in JobImpl.java: {code} Path remoteJobTokenFile = new Path(job.remoteJobSubmitDir, MRJobConfig.APPLICATION_TOKENS_FILE); tokenStorage.writeTokenStorageFile(remoteJobTokenFile, job.conf); LOG.info("Writing back the job-token file on the remote file system:" + remoteJobTokenFile.toString()); {code} We overwrite the app tokens file in the MRAppMaster. This file is one of the files listed as the resources for starting the MRAppMaster. The timestamp of the resource when added from the client goes stale due to the changes in the MRAppMaster. We can probably move this to the client side to create the jobtoken file that can be use for authenticating tasks to the AM. Thoughts? Issues? > AM fails to restart when first AM is killed > ------------------------------------------- > > Key: MAPREDUCE-3233 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3233 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 > Affects Versions: 0.23.0 > Reporter: Karam Singh > Priority: Blocker > > Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn cluster. > Sumbitted Sleep Job of 100K maps tasks as following -: > $HADOOP_COMMON_HOME/bin/hadoop jar $HADOOP_MAPRED_HOME/hadoop-test.jar sleep -m 100000 -r 0 -mt 1000 -rt 1000 > when around 53K tasks go, login node running AppMaster, and killed AppMaster with kill -9 > Resource Manager tried restart AM uptio max-retris but failed with following -: > {code} > 11/10/19 15:29:09 INFO mapreduce.Job: Job job_1319036155027_0002 failed with state FAILED due to: Application > application_1319036155027_0002 failed 5 times due to AM Container for appattempt_1319036155027_0002_000005 exited with > exitCode: -1000 due to: RemoteTrace: > java.io.IOException: Resource > hdfs://:/user//.staging/job_1319036155027_0002/appTokens changed on src > filesystem (expected 1319037705427, was 1319037714496 > at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.FSDownload.copy(FSDownload.java:80) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.FSDownload.access$000(FSDownload.java:49) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.FSDownload$1.run(FSDownload.java:149) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.FSDownload$1.run(FSDownload.java:147) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1152) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.FSDownload.call(FSDownload.java:145) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.FSDownload.call(FSDownload.java:49) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > at LocalTrace: > org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Resource > hdfs://:/user//.staging/job_1319036155027_0002/appTokens changed on src > filesystem (expected 1319037705427, was 1319037714496 > at > org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217) > at > org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:798) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.processHeartbeat(ResourceLocalizationService.java:483) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.heartbeat(ResourceLocalizationService.java:228) > at > org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.service.LocalizationProtocolPBServiceImpl.heartbeat(LocalizationProtocolPBServiceImpl.java:46) > at > org.apache.hadoop.yarn.proto.LocalizationProtocol$LocalizationProtocolService$2.callBlockingMethod(LocalizationProtocol.java:57) > at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Server.call(ProtoOverHadoopRpcEngine.java:343) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1486) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1482) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1152) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1480) > .Failing this attempt.. Failing the application. > 11/10/19 15:29:09 INFO mapreduce.Job: Counters: 0 > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira