hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joe Ellis (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HADOOP-13064) LineReader reports incorrect number of bytes read resulting in correctness issues using LineRecordReader
Date Tue, 26 Apr 2016 21:13:12 GMT
Joe Ellis created HADOOP-13064:
----------------------------------

             Summary: LineReader reports incorrect number of bytes read resulting in correctness
issues using LineRecordReader
                 Key: HADOOP-13064
                 URL: https://issues.apache.org/jira/browse/HADOOP-13064
             Project: Hadoop Common
          Issue Type: Bug
    Affects Versions: 2.7.1
            Reporter: Joe Ellis
            Priority: Critical


The specific issue we were seeing with LineReader is that when we pass in '\r\n' as the line
delimiter the number of bytes that it claims to have read is less than what it actually read.
We narrowed this down to only happening when the delimiter is split across the internal buffer
boundary, so if fillbuffer fills with "row\r" and the next call fills with "\n" then the number
of bytes reported would be 4 rather than 5.

This results in correctness issues in LineRecordReader because if this off by one issue is
seen enough times when reading a split then it will continue to read records past its split
boundary, resulting in records appearing to come from multiple splits.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message