Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 92916 invoked from network); 1 Oct 2008 04:40:07 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 1 Oct 2008 04:40:07 -0000 Received: (qmail 71774 invoked by uid 500); 1 Oct 2008 04:40:04 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 71763 invoked by uid 500); 1 Oct 2008 04:40:04 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 71748 invoked by uid 99); 1 Oct 2008 04:40:04 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Sep 2008 21:40:04 -0700 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Oct 2008 04:39:10 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id D792D234C1FB for ; Tue, 30 Sep 2008 21:39:44 -0700 (PDT) Message-ID: <654361532.1222835984881.JavaMail.jira@brutus> Date: Tue, 30 Sep 2008 21:39:44 -0700 (PDT) From: "Devaraj Das (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-4296) Spasm of JobClient failures on successful jobs every once in a while In-Reply-To: <1228560926.1222535626082.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635942#action_12635942 ] Devaraj Das commented on HADOOP-4296: ------------------------------------- Hi Dhruba, that's a fair point. The RPC handler thread would be blocked during the scan. I am worried about the submitJob API as well. That does a bunch of dfs operations too. We probably should do something about that as well (later on). However, since we already do dfs scans per job inline in the RPC handler (and AFAIK there is no noticeable impact), I'd like to see the impact of these additional dfs lookups on your system... Would that be a lot of work at your end? > Spasm of JobClient failures on successful jobs every once in a while > -------------------------------------------------------------------- > > Key: HADOOP-4296 > URL: https://issues.apache.org/jira/browse/HADOOP-4296 > Project: Hadoop Core > Issue Type: Bug > Components: mapred > Affects Versions: 0.17.1 > Reporter: Joydeep Sen Sarma > Assignee: dhruba borthakur > Priority: Critical > Attachments: 4296_jt_delayretire.patch > > > At very busy times - we get a wave of job client failures all at the same time. the failures come when the job is about to complete. when we look at the job history files - the jobs are actually complete. Here's the stack: > 08/09/27 02:18:00 INFO mapred.JobClient: map 100% reduce 98% > 08/09/27 02:18:41 INFO mapred.JobClient: map 100% reduce 99% > java.lang.NullPointerException > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:993) > at com.facebook.hive.common.columnSetLoader.main(columnSetLoader.java:535) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:155) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.