hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-11445) Bzip2Codec: Data block is skipped when position of newly created stream is equal to start of split
Date Tue, 06 Jan 2015 21:32:34 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-11445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14266813#comment-14266813
] 

Hudson commented on HADOOP-11445:
---------------------------------

FAILURE: Integrated in Hadoop-trunk-Commit #6816 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6816/])
HADOOP-11445. Bzip2Codec: Data block is skipped when position of newly created stream is equal
to start of split. Contributed by Ankit Kamboj (jlowe: rev d02fb53750bc592c23ba470ae82eb6f47d9a00ec)
* hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/BZip2Codec.java
* hadoop-common-project/hadoop-common/CHANGES.txt
* hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapred/TestLineRecordReader.java


> Bzip2Codec: Data block is skipped when position of newly created stream is equal to start
of split
> --------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-11445
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11445
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.4.0
>            Reporter: Ankit Kamboj
>            Assignee: Ankit Kamboj
>            Priority: Critical
>             Fix For: 2.7.0
>
>         Attachments: HADOOP-11445.001.patch
>
>
> bz2 input files are handled by FileInputFormat+LineRecordReader. In LineRecordReader,
bz2 specific compressed input stream is created to iterate over records. After every new creation,
the stream points to the beginning of next data block. The logic to find the beginning of
next block depends on start of the split. The search begins at 10 bytes behind the start of
split. If the first search creates input stream whose position is before or at start of split,
next block beginning is sought (assuming that the record reader for previous split would have
already iterated over the the data block in which current start of split lies). If the split
start is just at the byte where a newly created stream is positioned (start of data block),
attempt is made to find beginning of next data block. This doesn't seem correct because this
will result in jumping a whole block and will result in missing records.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message