hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dustin Cote (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records
Date Sat, 14 Nov 2015 21:39:11 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dustin Cote updated MAPREDUCE-6549:
-----------------------------------
    Attachment: MAPREDUCE-6549-1.patch

Attaching a patch to basically remove the attempt to read the last incomplete record of an
input and change the tests to test a more generic, imperfect scenario.  I'll add some more
tests if review deems it necessary.  As far as I am aware, we should drop an incomplete record
at the end of the input, which now this happens with this patch in addition to the correct
number of records coming up in the middle of the input (where previously there were duplicates).

> multibyte delimiters with LineRecordReader cause duplicate records
> ------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6549
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.7.2
>            Reporter: Dustin Cote
>            Assignee: Dustin Cote
>         Attachments: MAPREDUCE-6549-1.patch
>
>
> LineRecorderReader currently produces duplicate records under certain scenarios such
as:
> 1) input string: "abc+++def++ghi++" 
> delimiter string: "+++" 
> test passes with all sizes of the split 
> 2) input string: "abc++def+++ghi++" 
> delimiter string: "+++" 
> test fails with a split size of 4 
> 2) input string: "abc+++def++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 5 
> 3) input string "abc+++defg++hij++" 
> delimiter string: "++" 
> test fails with a split size of 4 
> 4) input string "abc++def+++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 9 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message