Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CD6C0D732 for ; Thu, 12 Jul 2012 05:55:43 +0000 (UTC) Received: (qmail 5576 invoked by uid 500); 12 Jul 2012 05:55:42 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 5467 invoked by uid 500); 12 Jul 2012 05:55:39 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 5381 invoked by uid 99); 12 Jul 2012 05:55:36 -0000 Received: from issues-vm.apache.org (HELO issues-vm) (140.211.11.160) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Jul 2012 05:55:36 +0000 Received: from isssues-vm.apache.org (localhost [127.0.0.1]) by issues-vm (Postfix) with ESMTP id ECD38142822 for ; Thu, 12 Jul 2012 05:55:34 +0000 (UTC) Date: Thu, 12 Jul 2012 05:55:34 +0000 (UTC) From: "Rahul Jain (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: <1283612463.40377.1342072534971.JavaMail.jiratomcat@issues-vm> In-Reply-To: <1005904836.37008.1342031254795.JavaMail.jiratomcat@issues-vm> Subject: [jira] [Updated] (MAPREDUCE-4428) A failed job is not available under job history if the job is killed right around the time job is notified as failed MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-4428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rahul Jain updated MAPREDUCE-4428: ---------------------------------- Attachment: resrcmgr_bad.txt Here are the resource manager logs appended for failure case. Note that resource manager was not restarted any time; and the same stack trace can be found on the resource manager when the application attempts to unregister {code} org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Application doesn't exist in cache appattempt_1341894680756_0017_000001.... {code} > A failed job is not available under job history if the job is killed right around the time job is notified as failed > --------------------------------------------------------------------------------------------------------------------- > > Key: MAPREDUCE-4428 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4428 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobhistoryserver, jobtracker > Affects Versions: 2.0.0-alpha > Reporter: Rahul Jain > Attachments: appMaster_bad.txt, appMaster_good.txt, resrcmgr_bad.txt > > > We have observed this issue consistently running hadoop CDH4 version (based upon 2.0 alpha release): > In case our hadoop client code gets a notification for a completed job ( using RunningJob object job, with (job.isComplete() && job.isSuccessful()==false) > the hadoop client code does an unconditional job.killJob() to terminate the job. > With earlier hadoop versions (verified on hadoop 0.20.2 version), we still have full access to job logs afterwards through hadoop console. However, when using MapReduceV2, the failed hadoop job no longer shows up under jobhistory server. Also, the tracking URL of the job still points to the non-existent Application master http port. > Once we removed the call to job.killJob() for failed jobs from our hadoop client code, we were able to access the job in job history with mapreduce V2 as well. Therefore this appears to be a race condition in the job management wrt. job history for failed jobs. > We do have the application master and node manager logs collected for this scenario if that'll help isolate the problem and the fix better. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira