hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bwolen Yang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1489) Input file get truncated for text files with \r\n
Date Sun, 17 Jun 2007 23:34:26 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505634
] 

Bwolen Yang commented on HADOOP-1489:
-------------------------------------


> changes under hadoop/fs/, It will heavily conflict with HADOOP-1470. 

are you referring to  3 changes in 
   hadoop/fs/ChecksumFileSystem.java
two in callers to readBuffer(), and one in readBuffer() itself.

These 3 changes are checking for 2 assumptions:
1. read() assumes "len >= bytesPerSum"
2. readBuffer() assumes if 0 bytes read, then it must be end-of-file.

- The changes for 1 is 11 lines in 1 code block.
- The changes for 2 is 11 lines in 2 code blocks.

Generally, it would be nice for unusual assumptions to be verified at run time if it doesn't
have much performance impacts.
Though if people think it would generate too much integration pain, I will remove the checks.
  Please let me know which
of these 3 changes would impact integration badly. 

A separate question, would either of these assumptions go away for HADOOP-1470?
If not, could the implementer please add the checks in his/her version?
Alternatively, will HADOOP-1470 be ready soon?  If so, I would be happy to add those checks
back in after 1470 get submitted.

let me know.  thanks


> Input file get truncated for text files with \r\n
> -------------------------------------------------
>
>                 Key: HADOOP-1489
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1489
>             Project: Hadoop
>          Issue Type: Bug
>          Components: io
>    Affects Versions: 0.13.0
>            Reporter: Bwolen Yang
>         Attachments: HADOOP-1489.patch, MRIdentity.java, slashr33.txt
>
>
> When input file has \r\n, LineRecordReader uses mark()/reset() to read one byte ahead
to check if \r is followed by \n.   This probably caused the BufferedInputStream to issue
a small read request (e.g., 127 bytes).   The  ChecksumFileSystem.FSInputChecker.read() code

> {code}
>    public int read(byte b[], int off, int len) throws IOException {
>      // make sure that it ends at a checksum boundary
>      long curPos = getPos();
>      long endPos = len+curPos/bytesPerSum*bytesPerSum;
>      return readBuffer(b, off, (int)(endPos-curPos));
>    }
> {code}
> tries to truncate "len" to checksum boundary.  For DFS, bytesPerSum is 512.  So for small
reads, the truncated length become negative (i.e., endPos - curPos is < 0).   The underlying
DFS read returns 0 when length is negative.  However, readBuffer changes it to -1 assuming
end-of-file has been reached.   This means effectively, the rest of the input file did not
get read.  In my case, only 8MB of a 52MB file is actually read.   Two sample stacks are appended.
> One related issue, if there are assumptions (such as len >= bytesPerSum) in FSInputChecker's
read(), would it be ok to add a check that throws an exception when the assumption is violated?
  This assumption is a bit unusal and as code changes (both Hadoop and Java's implementation
of BufferedInputStream), the assumption may get violated.  This silently dropping large part
of input seems really difficult for people to notice (and debug) when people start to deal
with terabytes of data.   Also, I suspect the performance impact for such a check would not
be noticed.
> bwolen
> Here are two sample stacks.  (i have readbuffer throw when it gets 0 bytes, and have
inputchecker catches the exception and rethrow both.  This way, I catch the values from both
caller and callee (see the callee one starts with "Caused by")
> -------------------------------------
> {code}
> java.lang.RuntimeException: end of read()
> in=org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker len=127
> pos=45223932 res=-999999
>        at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:50)
>        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>        at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>        at org.apache.hadoop.fs.FSDataInputStream$Buffer.read(FSDataInputStream.java:116)
>        at java.io.FilterInputStream.read(FilterInputStream.java:66)
>        at org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:132)
>        at org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:124)
>        at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:108)
>        at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:168)
>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:44)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:186)
>        at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1720)
> Caused by: java.lang.RuntimeException: end of read()
> datas=org.apache.hadoop.dfs.DFSClient$DFSDataInputStream pos=45223932
> len=-381 bytesPerSum=512 eof=false read=0
>        at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:200)
>        at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:175)
>        at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:47)
>        ... 11 more
> ---------------
> java.lang.RuntimeException: end of read()  in=org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker
len=400 pos=4503 res=-999999
> 	at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:50)
> 	at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> 	at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
> 	at org.apache.hadoop.fs.FSDataInputStream$Buffer.read(FSDataInputStream.java:116)
> 	at java.io.FilterInputStream.read(FilterInputStream.java:66)
> 	at org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:132)
> 	at org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:124)
> 	at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:108)
> 	at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:168)
> 	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:44)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:186)
> 	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1720)
> Caused by: java.lang.RuntimeException: end of read()  datas=org.apache.hadoop.dfs.DFSClient$DFSDataInputStream
pos=4503 len=-7 bytesPerSum=512 eof=false read=0
> 	at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:200)
> 	at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:175)
> 	at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:47)
> 	... 11 more
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message