hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-9622) bzip2 codec can drop records when reading data in splits
Date Mon, 18 Nov 2013 22:25:21 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13825854#comment-13825854
] 

Jason Lowe commented on HADOOP-9622:
------------------------------------

bq. {{inDelimiter}} is insufficient because {{LineReader::readDefaultLine}} will match \n,
while {{LineReader::readCustomLine}} would consider a partial match incomplete and require
an extra line?

Yes, the crux of the issue is the default delimiter supports a subset of the delimiter as
a valid delimiter (i.e.: \r\n is a delimiter but so is \r or \n).  The custom delimiter support
does not allow a subset of the specified delimiter to be a valid delimiter as well, so it
won't recognize the start of the characters as a delimiter and will read an extra line before
starting.

bq. I looked briefly at the custom delimiter code, and I'm not seeing how it handles splits
that start in the middle of a delimiter. I must be missing something obvious...

Yeah, it does look like there's a problem with the handling of custom record delimiters on
uncompressed input.  For this to work properly we need the consumer of the previous split
to handle all bytes up to and including the first full record delimiter that starts at or
after its split ends.  With this patch I think we have this case covered for compressed input
due to the needAdditionalRecordAfterSplit logic.  However since the custom delimiter line
reader seems to be returning the size of the record and subsequent delimiter bytes as the
bytes consumed, I think we will end up reporting the end of the split too early to the LineRecordReader
for uncompressed data in the case where the delimiter straddles the split boundary.

To verify there's a problem, I ran a simple wordcount on the following input data:

{noformat}

abcxxx
defxxx
ghixxx
jklxxx
mnoxxx
pqrxxx
stuxxx
vw xxx
xyzxxx
{noformat}

and then I ran it with the options {{-Dmapreduce.input.fileinputformat.split.maxsize=34 -Dtextinputformat.record.delimiter=xxx}}.
 The resulting output looked like this:

{noformat}
abc	1
def	1
ghi	1
jkl	1
mno	1
stu	1
vw	1
xyz	1
{noformat}

So we dropped the "pqr" record.  Not good.

I'm tempted to handle this as a separate JIRA since I believe this will be an issue only with
uncompressed inputs after this patch.

> bzip2 codec can drop records when reading data in splits
> --------------------------------------------------------
>
>                 Key: HADOOP-9622
>                 URL: https://issues.apache.org/jira/browse/HADOOP-9622
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: io
>    Affects Versions: 2.0.4-alpha, 0.23.8
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: HADOOP-9622-2.patch, HADOOP-9622-testcase.patch, HADOOP-9622.patch,
blockEndingInCR.txt.bz2, blockEndingInCRThenLF.txt.bz2
>
>
> Bzip2Codec.BZip2CompressionInputStream can cause records to be dropped when reading them
in splits based on where record delimiters occur relative to compression block boundaries.
> Thanks to [~knoguchi] for discovering this problem while working on PIG-3251.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message