Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 30456 invoked from network); 28 Mar 2007 16:31:50 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 28 Mar 2007 16:31:50 -0000 Received: (qmail 28216 invoked by uid 500); 28 Mar 2007 16:31:55 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 28106 invoked by uid 500); 28 Mar 2007 16:31:55 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 28016 invoked by uid 99); 28 Mar 2007 16:31:54 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Mar 2007 09:31:54 -0700 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Mar 2007 09:31:45 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id BB404714071 for ; Wed, 28 Mar 2007 09:31:25 -0700 (PDT) Message-ID: <16834717.1175099485763.JavaMail.jira@brutus> Date: Wed, 28 Mar 2007 09:31:25 -0700 (PDT) From: "Devaraj Das (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Updated: (HADOOP-1159) Reducers hang when map output file has a checksum error In-Reply-To: <25245855.1174926932140.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj Das updated HADOOP-1159: -------------------------------- Attachment: 1159.patch Owen, this patch handles the server side exception as well (it doesn't have the client-side exception handling that you do in your patch). Basically, changed the exception handling in doGet method to also handle NPE. The body of the IOException and NPE handlers are common and so moved to a separate method. > Reducers hang when map output file has a checksum error > ------------------------------------------------------- > > Key: HADOOP-1159 > URL: https://issues.apache.org/jira/browse/HADOOP-1159 > Project: Hadoop > Issue Type: Bug > Components: mapred > Affects Versions: 0.12.2 > Reporter: Nigel Daley > Assigned To: Owen O'Malley > Fix For: 0.12.3 > > Attachments: 1159.patch, h1159-2.patch, h1159.patch > > > Two reduces hung in our sort benchmark. They always fail to get map outputs from node X due to checksum error when the map outputs are read at that node resulting in a NullPointerException on node X. This leads to constant failures on the two fetching reduces. > 2007-03-26 00:02:57,082 WARN org.apache.hadoop.fs.FileSystem: Moving bad file /e/c/k/hqa/tb/tmp/mapred/local2/task_0002_m_022488_0/file.out to /e/c/bad_files/file.out.542279301 > 2007-03-26 00:02:57,083 INFO org.apache.hadoop.fs.FSInputChecker: Found checksum error: org.apache.hadoop.fs.ChecksumException: Checksum error: /e/c/k/hqa/tb/tmp/mapred/local2/task_0002_m_022488_0/file.out at 106484224 > at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.verifySum(ChecksumFileSystem.java:254) > at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:211) > at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:167) > at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:41) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) > at java.io.BufferedInputStream.read(BufferedInputStream.java:317) > at java.io.DataInputStream.read(DataInputStream.java:132) > at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:1659) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:689) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:802) > at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427) > at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475) > at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567) > at org.mortbay.http.HttpContext.handle(HttpContext.java:1565) > at org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635) > at org.mortbay.http.HttpContext.handle(HttpContext.java:1517) > at org.mortbay.http.HttpServer.service(HttpServer.java:954) > at org.mortbay.http.HttpConnection.service(HttpConnection.java:814) > at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981) > at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831) > at org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244) > at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357) > at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534) > 2007-03-26 00:02:57,083 WARN /: /mapOutput?map=task_0002_m_022488_0&reduce=1542: > java.lang.NullPointerException -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.