hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-14919) BZip2 drops records when reading data in splits
Date Mon, 30 Oct 2017 18:40:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225517#comment-16225517

Jason Lowe commented on HADOOP-14919:

Had some more time to look into this today.  HADOOP-6852 is legit.  Concatenated bz2 files
don't work at all in Hadoop:
$ echo hey | bzip2 -c > foo.bz2
$ echo there | bzip2 -c >> foo.bz2
$ bzcat foo.bz2
$ hadoop fs -put foo.bz2
$ hadoop fs -text foo.bz2     
2017-10-30 13:18:08,083 INFO  [main] compress.CodecPool (CodecPool.java:getDecompressor(184))
- Got brand-new decompressor [.bz2]
text: bad block header

I don't think it would be very difficult to add support for concatenation.  IIUC all it needs
to do is account for the possibility that 'BZh9' could appear before the block marker.  We
should _not_ updated the reported position when skipping just the 'BZh9' bytes and only when
we move from block mark to block mark.  The existing behavior of skipping at the file offset
0 is benign, but I don't think we want/need to update reported position when skipping these
extra bytes mid-stream.

bq. The reader handling the previous block will detect the end of its stream, and a split
following it should find the block delimiter after the header of the next file. However, if
the text splits are around the concat point, the BZh9 bytes may not be unaccounted for.

Assuming we add the ability to silently skip 'BZh9' we should still be OK.  The compression
input stream will only report the position moving when the next block starts to be read. 
Whether we have 'BZh9' bytes or not doesn't change that.  We either read the whole block header
and marker or none of it.  The upper layer reader will continue reading until the reported
position changes, so the upper layer semantics don't change based on the presence of the extra
header bytes.  Therefore I argue we're either OK or already screwed whether there's an extra
header there or not.

bq. Would split0 pick up the ooooooo bytes?

I had a little trouble following the example and knowing what was a record delimiter.  In
general the split reader is responsible for reading until a record ends in the next split
because the next reader will always toss away the first record.  "Ends in the next split"
means the entire delimiter appears in the next split, since the next split reader will toss
all bytes up to and including the first record delimiter found.  There's some complicated
logic in LineRecordReader and SplitLineReader to account for buffering occurring at both the
codec and line reader levels along with the games codecs can play with reported byte position
in the stream.

bq. It doesn't look like the unit tests cover a combination of multi-byte delimiters and splittable

See TestLineRecordReader#testBzipWithMultibyteDelimiter and compressedMultibyteDelimiter.txt.bz2.
 I doubt it is exhaustive of all the corner cases, but there is at least one test there.

At this point I think we're good to go with committing this and addressing concatenated bz2
in HADOOP-6852.  As such I'll commit this tomorrow if there are no objections.

> BZip2 drops records when reading data in splits
> -----------------------------------------------
>                 Key: HADOOP-14919
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14919
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.8.0, 2.7.3, 3.0.0-alpha1
>            Reporter: Aki Tanaka
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: 250000.bz2, HADOOP-14919-test.patch, HADOOP-14919.001.patch
> BZip2 can drop records when reading data in splits. This problem was already discussed
before in HADOOP-11445 and HADOOP-13270. But we still have a problem in corner case, causing
lost data blocks.
> I attached a unit test for this issue. You can reproduce the problem if you run the unit
> First, this issue happens when position of newly created stream is equal to start of
split. Hadoop has some test cases for this (blockEndingInCR.txt.bz2 file for TestLineRecordReader#testBzip2SplitStartAtBlockMarker,
etc). However, the issue I am reporting does not happen when we run these tests because this
issue happens only when the start of split byte block includes both block marker and compressed
> BZip2 block marker - 0x314159265359 (001100010100000101011001001001100101001101011001)
> blockEndingInCR.txt.bz2 (Start of Split - 136504):
> {code:java}
> $ xxd -l 6 -g 1 -b -seek 136498 ./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/target/test-classes/blockEndingInCR.txt.bz2
> 0021532: 00110001 01000001 01011001 00100110 01010011 01011001  1AY&SY
> {code}
> Test bz2 File (Start of Split - 203426)
> {code:java}
> $ xxd -l 7 -g 1 -b -seek 203419 250000.bz2
> 0031a9b: 11100110 00101000 00101011 00100100 11001010 01101011  .(+$.k
> 0031aa1: 00101111                                               /
> {code}
> Let's say a job splits this test bz2 file into two splits at the start of split (position
> The former split does not read records which start position 203426 because BZip2 says
the position of these dropped records is 203427. The latter split does not read the records
because BZip2CompressionInputStream read the block from position 320955.
> Due to this behavior, records between 203427 and 320955 are lost.
> Also, if we reverted the changes in HADOOP-13270, we will not see this issue. We will
see HADOOP-13270 issue though.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message