Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 62836 invoked from network); 19 Oct 2007 18:18:14 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 19 Oct 2007 18:18:14 -0000 Received: (qmail 46305 invoked by uid 500); 19 Oct 2007 18:17:59 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 46273 invoked by uid 500); 19 Oct 2007 18:17:59 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 46252 invoked by uid 99); 19 Oct 2007 18:17:59 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Oct 2007 11:17:59 -0700 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Oct 2007 18:18:11 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id B83F471420A for ; Fri, 19 Oct 2007 11:17:50 -0700 (PDT) Message-ID: <22296423.1192817870752.JavaMail.jira@brutus> Date: Fri, 19 Oct 2007 11:17:50 -0700 (PDT) From: "Richard Lee (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Created: (HADOOP-2080) ChecksumFileSystem checksum file size incorrect. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org ChecksumFileSystem checksum file size incorrect. ------------------------------------------------ Key: HADOOP-2080 URL: https://issues.apache.org/jira/browse/HADOOP-2080 Project: Hadoop Issue Type: Bug Components: fs Affects Versions: 0.14.2, 0.14.1, 0.14.0 Environment: Sun jdk1.6.0_02 running on Linux CentOS 5 Reporter: Richard Lee Periodically, reduce tasks hang. When the log for the task is consulted, you see a stacktrace that looks like this: 2007-10-18 17:02:04,227 WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Insufficient space at org.apache.hadoop.fs.InMemoryFileSystem$RawInMemoryFileSystem$InMemoryOutputStream.write(InMemoryFileSystem.java:174) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:39) at java.io.DataOutputStream.write(DataOutputStream.java:90) at java.io.FilterOutputStream.write(FilterOutputStream.java:80) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.writeChunk(ChecksumFileSystem.java:326) at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:140) at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:122) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:310) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64) at org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:253) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:685) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:637) The problem stems from a miscalculation of the checksum file created in the InMemoryFileSystem associated with the data being copied from a completed mapper task to the reducer task. The method used for calculating checksum file size is the following (ChecksumFileSystem:318): ((long)(Math.ceil((float)size/bytesPerSum)) + 1) * 4 + CHECKSUM_VERSION.length; The issue here is the cast to float. Floating point numbers have only 24 bits of precision, thus will return short values on any size over 0x1000000. The fix is to replace this calculation with something that doesn't cast to float. (((size+1)/bytesPerSum) + 2) * 4 + CHECKSUM_VERSION.length -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.