hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kris Geusebroek (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HADOOP-9867) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well
Date Tue, 13 Aug 2013 12:33:48 GMT
Kris Geusebroek created HADOOP-9867:
---------------------------------------

             Summary: org.apache.hadoop.mapred.LineRecordReader does not handle multibyte
record delimiters well
                 Key: HADOOP-9867
                 URL: https://issues.apache.org/jira/browse/HADOOP-9867
             Project: Hadoop Common
          Issue Type: Bug
          Components: io
    Affects Versions: 0.20.2
         Environment: CDH3U2 Redhat linux 5.7
            Reporter: Kris Geusebroek


Having defined a recorddelimiter of multiple bytes in a new InputFileFormat sometimes has
the effect of skipping records from the input.

This happens when the input splits are split off just after a recordseparator. Starting point
for the next split would be non zero and skipFirstLine would be true. A seek into the file
is done to start - 1 and the text until the first recorddelimiter is ignored (due to the presumption
that this record is already handled by the previous maptask). Since the re ord delimiter is
multibyte the seek only got the last byte of the delimiter into scope and its not recognized
as a full delimiter. So the text is skipped until the next delimiter (ignoring a full record!!)


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message