Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 1264 invoked from network); 15 Jun 2007 01:53:50 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 15 Jun 2007 01:53:50 -0000 Received: (qmail 32598 invoked by uid 500); 15 Jun 2007 01:53:50 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 32547 invoked by uid 500); 15 Jun 2007 01:53:50 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 32532 invoked by uid 99); 15 Jun 2007 01:53:50 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Jun 2007 18:53:50 -0700 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Jun 2007 18:53:46 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 7A84F714168 for ; Thu, 14 Jun 2007 18:53:26 -0700 (PDT) Message-ID: <20647114.1181872406486.JavaMail.jira@brutus> Date: Thu, 14 Jun 2007 18:53:26 -0700 (PDT) From: "Bwolen Yang (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Commented: (HADOOP-1489) Input file get truncated for text files with \r\n In-Reply-To: <16448769.1181759065877.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12504999 ] Bwolen Yang commented on HADOOP-1489: ------------------------------------- > Shall we enforce that LineRecordReader.readLine to take BufferedInputStream as an input? would It be too restrictive to force BufferedInputStream on future extensions of LineRecordReader? The only requirement is that InputStream needs to support mark/reset. This would be easy to check if a LineReader class got broken out of LineRecordReader. (though maybe an overkill till someone else needs a LineReader outside of LineRecordReader). Another thing is that reading BufferedReader.nextLine, its implementation does not depend on mark/reset as it remembers seeing \r means that when people request to read the next line, and if the first character is \n, skip it. It is too bad BufferedReader.nextLine returns a String instead of allowing people to pass it an OutputStream to write to.... Would have been nice to use it to simlify code and avoid a copy. > Input file get truncated for text files with \r\n > ------------------------------------------------- > > Key: HADOOP-1489 > URL: https://issues.apache.org/jira/browse/HADOOP-1489 > Project: Hadoop > Issue Type: Bug > Components: io > Affects Versions: 0.13.0 > Reporter: Bwolen Yang > Attachments: MRIdentity.java, slashr33.txt > > > When input file has \r\n, LineRecordReader uses mark()/reset() to read one byte ahead to check if \r is followed by \n. This probably caused the BufferedInputStream to issue a small read request (e.g., 127 bytes). The ChecksumFileSystem.FSInputChecker.read() code > {code} > public int read(byte b[], int off, int len) throws IOException { > // make sure that it ends at a checksum boundary > long curPos = getPos(); > long endPos = len+curPos/bytesPerSum*bytesPerSum; > return readBuffer(b, off, (int)(endPos-curPos)); > } > {code} > tries to truncate "len" to checksum boundary. For DFS, bytesPerSum is 512. So for small reads, the truncated length become negative (i.e., endPos - curPos is < 0). The underlying DFS read returns 0 when length is negative. However, readBuffer changes it to -1 assuming end-of-file has been reached. This means effectively, the rest of the input file did not get read. In my case, only 8MB of a 52MB file is actually read. Two sample stacks are appended. > One related issue, if there are assumptions (such as len >= bytesPerSum) in FSInputChecker's read(), would it be ok to add a check that throws an exception when the assumption is violated? This assumption is a bit unusal and as code changes (both Hadoop and Java's implementation of BufferedInputStream), the assumption may get violated. This silently dropping large part of input seems really difficult for people to notice (and debug) when people start to deal with terabytes of data. Also, I suspect the performance impact for such a check would not be noticed. > bwolen > Here are two sample stacks. (i have readbuffer throw when it gets 0 bytes, and have inputchecker catches the exception and rethrow both. This way, I catch the values from both caller and callee (see the callee one starts with "Caused by") > ------------------------------------- > {code} > java.lang.RuntimeException: end of read() > in=org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker len=127 > pos=45223932 res=-999999 > at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:50) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > at java.io.BufferedInputStream.read(BufferedInputStream.java:237) > at org.apache.hadoop.fs.FSDataInputStream$Buffer.read(FSDataInputStream.java:116) > at java.io.FilterInputStream.read(FilterInputStream.java:66) > at org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:132) > at org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:124) > at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:108) > at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:168) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:44) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:186) > at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1720) > Caused by: java.lang.RuntimeException: end of read() > datas=org.apache.hadoop.dfs.DFSClient$DFSDataInputStream pos=45223932 > len=-381 bytesPerSum=512 eof=false read=0 > at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:200) > at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:175) > at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:47) > ... 11 more > --------------- > java.lang.RuntimeException: end of read() in=org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker len=400 pos=4503 res=-999999 > at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:50) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > at java.io.BufferedInputStream.read(BufferedInputStream.java:237) > at org.apache.hadoop.fs.FSDataInputStream$Buffer.read(FSDataInputStream.java:116) > at java.io.FilterInputStream.read(FilterInputStream.java:66) > at org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:132) > at org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:124) > at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:108) > at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:168) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:44) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:186) > at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1720) > Caused by: java.lang.RuntimeException: end of read() datas=org.apache.hadoop.dfs.DFSClient$DFSDataInputStream pos=4503 len=-7 bytesPerSum=512 eof=false read=0 > at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:200) > at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:175) > at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:47) > ... 11 more > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.