hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-9622) bzip2 codec can drop records when reading data in splits
Date Fri, 07 Jun 2013 22:48:20 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jason Lowe updated HADOOP-9622:
-------------------------------

    Attachment: HADOOP-9622.patch
                blockEndingInCRThenLF.txt.bz2
                blockEndingInCR.txt.bz2

Attaching a draft of a patch that I believe will fix the issue.  Comments welcome.

I no longer believe this is a codec issue, since the codec doesn't know anything about record
delimiters.  The codec is properly reporting when the next split has started to be read. 
The problem actually lies between the LineRecordReader and LineReader when the codec is involved,
as the LineRecordReader is relying solely on the codec to report when the split has completed,
oblivious to the buffering and peeking going on in LineReader.  If others agree, I can move
this to a MAPREDUCE JIRA.

The patch makes the LineRecordReader aware of the fact that the split ended in the middle
of a delimiter, so it can decide to read another record after the codec reports the split
ended.

Added some unit tests which uses a couple of test files that I'm also attaching.  These need
to be dropped into hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/resources/
so the unit tests can find them.

Any feedback is appreciated.  I'll also work on some tests with multi-byte custom delimiters
where the split ends in the middle of the delimiter.
                
> bzip2 codec can drop records when reading data in splits
> --------------------------------------------------------
>
>                 Key: HADOOP-9622
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9622
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: io
>    Affects Versions: 2.0.4-alpha, 0.23.8
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: blockEndingInCRThenLF.txt.bz2, blockEndingInCR.txt.bz2, HADOOP-9622.patch,
HADOOP-9622-testcase.patch
>
>
> Bzip2Codec.BZip2CompressionInputStream can cause records to be dropped when reading them
in splits based on where record delimiters occur relative to compression block boundaries.
> Thanks to [~knoguchi] for discovering this problem while working on PIG-3251.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message