hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-5948) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well
Date Fri, 19 Jun 2015 18:16:02 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593729#comment-14593729
] 

Jason Lowe commented on MAPREDUCE-5948:
---------------------------------------

+1 for the latest patch.  This should resolve the dropped/duplicate problems with uncompressed
input.  We can tackle the reported duplicate records for bz2 in MAPREDUCE-6299.

Will commit this early next week if there are no objections.

> org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters
well
> ------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5948
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5948
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.20.2, 0.23.9, 2.2.0
>         Environment: CDH3U2 Redhat linux 5.7
>            Reporter: Kris Geusebroek
>            Assignee: Akira AJISAKA
>            Priority: Critical
>         Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch,
MAPREDUCE-5948.002.patch, MAPREDUCE-5948.003.patch
>
>
> Having defined a recorddelimiter of multiple bytes in a new InputFileFormat sometimes
has the effect of skipping records from the input.
> This happens when the input splits are split off just after a recordseparator. Starting
point for the next split would be non zero and skipFirstLine would be true. A seek into the
file is done to start - 1 and the text until the first recorddelimiter is ignored (due to
the presumption that this record is already handled by the previous maptask). Since the re
ord delimiter is multibyte the seek only got the last byte of the delimiter into scope and
its not recognized as a full delimiter. So the text is skipped until the next delimiter (ignoring
a full record!!)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message