hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-5656) bzip2 codec can drop records when reading data in splits
Date Mon, 02 Dec 2013 15:29:40 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13836600#comment-13836600

Jason Lowe commented on MAPREDUCE-5656:

That case is what the {{finished}} flag in CompressedSplitLineReader is intended to catch.
 Here's the scenario:

# LineRecordReader calls readLine
# The line processing causes us to fetch the next compressed block beyond the split (i.e.:
fillBuffer is called).  Let's say this causes us to set needAdditionalRecord=true.
# LineRecordReader will process another iteration of the loop and call readLine again
# readLine will notice that we are starting at a position past the end of the split and set
# At that point the needAdditionalRecordAfterSplit method will always return false and LineRecordReader
should not read more than at most one record beyond the end of the split.

The key is needAdditionalRecordAfterSplit() will always return false once readLine() is invoked
at a position after the split ends.

> bzip2 codec can drop records when reading data in splits
> --------------------------------------------------------
>                 Key: MAPREDUCE-5656
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5656
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.0.4-alpha, 0.23.8
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: HADOOP-9622-2.patch, HADOOP-9622-testcase.patch, HADOOP-9622.patch,
MAPREDUCE-5656.patch, blockEndingInCR.txt.bz2, blockEndingInCRThenLF.txt.bz2
> Bzip2Codec.BZip2CompressionInputStream can cause records to be dropped when reading them
in splits based on where record delimiters occur relative to compression block boundaries.
> Thanks to [~knoguchi] for discovering this problem while working on PIG-3251.

This message was sent by Atlassian JIRA

View raw message