Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E41FA18781 for ; Tue, 23 Jun 2015 14:19:04 +0000 (UTC) Received: (qmail 99704 invoked by uid 500); 23 Jun 2015 14:19:04 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 99649 invoked by uid 500); 23 Jun 2015 14:19:04 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 99637 invoked by uid 99); 23 Jun 2015 14:19:04 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Jun 2015 14:19:04 +0000 Date: Tue, 23 Jun 2015 14:19:04 +0000 (UTC) From: "Hudson (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (MAPREDUCE-5948) org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-5948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597707#comment-14597707 ] Hudson commented on MAPREDUCE-5948: ----------------------------------- FAILURE: Integrated in Hadoop-Hdfs-trunk #2165 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2165/]) MAPREDUCE-5948. org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well. Contributed by Vinayakumar B, Rushabh Shah, and Akira AJISAKA (jlowe: rev 077250d8d7b4b757543a39a6ce8bb6e3be356c6f) * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestLineRecordReader.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/LineRecordReader.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/UncompressedSplitLineReader.java * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapred/TestLineRecordReader.java * hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java * hadoop-mapreduce-project/CHANGES.txt > org.apache.hadoop.mapred.LineRecordReader does not handle multibyte record delimiters well > ------------------------------------------------------------------------------------------ > > Key: MAPREDUCE-5948 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5948 > Project: Hadoop Map/Reduce > Issue Type: Bug > Affects Versions: 0.20.2, 0.23.9, 2.2.0 > Environment: CDH3U2 Redhat linux 5.7 > Reporter: Kris Geusebroek > Assignee: Akira AJISAKA > Priority: Critical > Fix For: 2.8.0 > > Attachments: HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, HADOOP-9867.patch, MAPREDUCE-5948.002.patch, MAPREDUCE-5948.003.patch > > > Having defined a recorddelimiter of multiple bytes in a new InputFileFormat sometimes has the effect of skipping records from the input. > This happens when the input splits are split off just after a recordseparator. Starting point for the next split would be non zero and skipFirstLine would be true. A seek into the file is done to start - 1 and the text until the first recorddelimiter is ignored (due to the presumption that this record is already handled by the previous maptask). Since the re ord delimiter is multibyte the seek only got the last byte of the delimiter into scope and its not recognized as a full delimiter. So the text is skipped until the next delimiter (ignoring a full record!!) -- This message was sent by Atlassian JIRA (v6.3.4#6332)