hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nathan Roberts (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-9622) bzip2 codec can drop records when reading data in splits
Date Wed, 12 Jun 2013 21:38:20 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13681668#comment-13681668
] 

Nathan Roberts commented on HADOOP-9622:
----------------------------------------

Looked over patch and approach seems reasonable. I'm not sure there is much else you can do
without changing the codecs themselves, or forcing everything to move one byte at a time (so
that we can know precisely when the codec moves beyond the split). 

It might help to add some comments somewhere in the code which specifically illustrate the
boundary conditions. As you said it's tricky (clearly since both pig and MR got it wrong in
slightly different edge cases), it certainly couldn't hurt to add some more commentary in
this area.
                
> bzip2 codec can drop records when reading data in splits
> --------------------------------------------------------
>
>                 Key: HADOOP-9622
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9622
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: io
>    Affects Versions: 2.0.4-alpha, 0.23.8
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: blockEndingInCRThenLF.txt.bz2, blockEndingInCR.txt.bz2, HADOOP-9622.patch,
HADOOP-9622-testcase.patch
>
>
> Bzip2Codec.BZip2CompressionInputStream can cause records to be dropped when reading them
in splits based on where record delimiters occur relative to compression block boundaries.
> Thanks to [~knoguchi] for discovering this problem while working on PIG-3251.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message