hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhihai xu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-6481) LineRecordReader may give incomplete record and wrong position/key information for uncompressed input sometimes.
Date Fri, 18 Sep 2015 07:26:04 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14805125#comment-14805125
] 

zhihai xu commented on MAPREDUCE-6481:
--------------------------------------

[~jlowe], thanks for the review and committing the patch! This patch will depend on MAPREDUCE-5948,
I can apply the patch cleanly after apply MAPREDUCE-5948. Shall we add both MAPREDUCE-5948
and MAPREDUCE-6481 to 2.7.2 release?

> LineRecordReader may give incomplete record and wrong position/key information for uncompressed
input sometimes.
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6481
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6481
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 2.7.0
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>            Priority: Critical
>             Fix For: 2.8.0
>
>         Attachments: MAPREDUCE-6481.000.patch
>
>
> LineRecordReader may give incomplete record and wrong position/key information for uncompressed
input sometimes.
> There are two issues:
> # LineRecordReader may give incomplete record: some characters cut off at the end of
record.
> # LineRecordReader may give wrong position/key information.
> The first issue only happens for Custom Delimiter, which is caused by the following code
at {{LineReader#readCustomLine}}:
> {code}
>     if (appendLength > 0) {
>         if (ambiguousByteCount > 0) {
>           str.append(recordDelimiterBytes, 0, ambiguousByteCount);
>           //appending the ambiguous characters (refer case 2.2)
>           bytesConsumed += ambiguousByteCount;
>           ambiguousByteCount=0;
>         }
>         str.append(buffer, startPosn, appendLength);
>         txtLength += appendLength;
>       }
> {code}
> If {{appendLength}} is 0 and {{ambiguousByteCount}} is not 0, this bug will be triggered.
For example, input is "123456789aab", Custom Delimiter is "ab", bufferSize is 10 and splitLength
is 12, the correct record should be "123456789a" with length 10, but we get incomplete record
"123456789" with length 9 from current code.
> The second issue can happen for both Custom Delimiter and Default Delimiter, which is
caused by the code in {{UncompressedSplitLineReader#readLine}}. {{UncompressedSplitLineReader#readLine}}
may report wrong size information at some corner cases. The reason is {{unusedBytes}} in the
following code:
> {code}
> bytesRead += unusedBytes;
> unusedBytes = bufferSize - getBufferPosn();
> bytesRead -= unusedBytes;
> {code}
> If the last bytes read (bufferLength) is less than bufferSize, the previous {{unusedBytes}}
will be wrong, which should be {{bufferLength}} - {{bufferPosn}} instead of bufferSize - {{bufferPosn}}.
It will return larger value.
> For example, input is "1234567890ab12ab345", Custom Delimiter is "ab", bufferSize is
10 and two splits:first splitLength is 15 and second splitLength 4:
> the current code will give the following result:
> First record: Key:0 Value:"1234567890"
> Second record: Key:12 Value:"12"
> Third Record: Key:21 Value:"345"
> You can see the Key for the third record is wrong, it should be 16 instead of 21. It
is due to wrong {{unusedBytes}}. {{fillBuffer}} read 10 bytes for the first time, for the
second times, it only read 5 bytes, which is 5 bytes less than the bufferSize. That is why
the key we get is 5 bytes larger than the correct one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message