pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Szita (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PIG-5373) InterRecordReader might skip records if certain sync markers are used
Date Thu, 20 Dec 2018 13:24:00 GMT

    [ https://issues.apache.org/jira/browse/PIG-5373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725832#comment-16725832

Adam Szita commented on PIG-5373:

Attached [^PIG-5373.0.patch] which corrects the reading of sync markers using a fifo, and compares the
fifo content with the expected marker.

Test case attached, which verifies in a brute force way, that such prefix scenarios are handled

[~nkollar], [~rohini] can you take a look please?

> InterRecordReader might skip records if certain sync markers are used
> ---------------------------------------------------------------------
>                 Key: PIG-5373
>                 URL: https://issues.apache.org/jira/browse/PIG-5373
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.17.0
>            Reporter: Adam Szita
>            Assignee: Adam Szita
>            Priority: Major
>         Attachments: PIG-5373.0.patch
> Due to bug in InterRecordReader#skipUntilMarkerOrSplitEndOrEOF(), it can happen that
sync markers are not identified while reading the interim binary file used to hold data between
> In such files sync markers are placed upon writing, which later help during reading the
data. These are random generated and it seems like that in some rare combinations of markers
and data preceding it, they cannot be not found. This can result in reading through all
the bytes (looking for the marker) and reaching split end or EOF, and extracting no records
at all.
> This symptom is also observable from JobHistory stats, where if a job is affected by
this issue, will have tasks that have HDFS_BYTES_READ or FILE_BYTES_READ about equal to
the number bytes of the split, but at the same time having MAP_INPUT_RECORDS=0
> One such (test) example is this:
> {code:java}
> marker: [-128, -128, 4] , data: [127, -1, 2, -128, -128, -128, 4, 1, 2, 3]{code}
> Due to a bug, such markers whose prefix overlap with the last data chunk are not seen
by the reader.

This message was sent by Atlassian JIRA

View raw message