Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 78075 invoked from network); 6 Oct 2008 06:48:37 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 6 Oct 2008 06:48:37 -0000 Received: (qmail 70906 invoked by uid 500); 6 Oct 2008 06:48:33 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 70897 invoked by uid 500); 6 Oct 2008 06:48:33 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 70886 invoked by uid 99); 6 Oct 2008 06:48:33 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 05 Oct 2008 23:48:33 -0700 X-ASF-Spam-Status: No, hits=-1999.9 required=10.0 tests=ALL_TRUSTED,DNS_FROM_SECURITYSAGE X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Oct 2008 06:47:38 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 60956234C21B for ; Sun, 5 Oct 2008 23:47:44 -0700 (PDT) Message-ID: <18279222.1223275664394.JavaMail.jira@brutus> Date: Sun, 5 Oct 2008 23:47:44 -0700 (PDT) From: "Joydeep Sen Sarma (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while In-Reply-To: <1228560926.1222535626082.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12637015#action_12637015 ] Joydeep Sen Sarma commented on HADOOP-4296: ------------------------------------------- we definitely care about the status of completed jobs (and i think most installations would - given that at least some of the uses are always programmatic invocations that check return status). does the jobstatus store need to scan dfs even when the job status is available in memory? (falling back to persistent store only when the data is missing in memory would seem like a good strategy). another question is whether job counters are available from the persisted job status? > Spasm of JobClient failures on successful jobs every once in a while > -------------------------------------------------------------------- > > Key: HADOOP-4296 > URL: https://issues.apache.org/jira/browse/HADOOP-4296 > Project: Hadoop Core > Issue Type: Bug > Components: mapred > Affects Versions: 0.17.1 > Reporter: Joydeep Sen Sarma > Assignee: dhruba borthakur > Priority: Critical > Attachments: 4296_jt_delayretire.patch > > > At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack: > 08/09/27 02:18:00 INFO mapred.JobClient: map 100% reduce 98% > 08/09/27 02:18:41 INFO mapred.JobClient: map 100% reduce 99% > java.lang.NullPointerException > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993) > at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:155) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.