hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (HADOOP-9622) bzip2 codec can drop records when reading data in splits
Date Thu, 06 Jun 2013 21:58:21 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jason Lowe reassigned HADOOP-9622:

    Assignee: Jason Lowe

This is a bit tricky to get right, as I ran into PIG-3352 while investigating it.  

However I think we can detect these CR/LF/CRLF boundary conditions properly, if the line reader
that is building the record reads the data byte-by-byte and notices the exact character where
the reported position goes past the end of the split.  At that point it can decide which of
the cases it is in and react properly.  That would also solve similar problems that exist
for custom, multi-byte delimiters that span block boundaries.

Currently the line reader is buffered, and it would be a shame to have to give that up.  I
think we can still use buffered reads from the codec stream with one critical assumption:
the codec will *never* return data spanning two blocks in a single read.  I'm assuming that's
the case today, since failure to do that would break the existing LineRecordReader->LineReader->SplittableCompressionCodec
relationship today.  LineReader is buffering data from the codec, but LineRecordReader is
checking the codec's position after each record returned.
> bzip2 codec can drop records when reading data in splits
> --------------------------------------------------------
>                 Key: HADOOP-9622
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9622
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: io
>    Affects Versions: 2.0.4-alpha, 0.23.8
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: HADOOP-9622-testcase.patch
> Bzip2Codec.BZip2CompressionInputStream can cause records to be dropped when reading them
in splits based on where record delimiters occur relative to compression block boundaries.
> Thanks to [~knoguchi] for discovering this problem while working on PIG-3251.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message