Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-issues@hadoop.apache.org
Date: Tue, 5 Jan 2016 14:06:39 +0000 (UTC)
From: "Jason Lowe (JIRA)" <jira@apache.org>
To: mapreduce-issues@hadoop.apache.org
Message-ID: <JIRA.12927252.1451993119000.21505.1452002799812@Atlassian.JIRA>
In-Reply-To: <JIRA.12927252.1451993119000@Atlassian.JIRA>
References: <JIRA.12927252.1451993119000@Atlassian.JIRA>
 <JIRA.12927252.1451993119239@arcas>
Subject: [jira] [Commented] (MAPREDUCE-6598) LineReader enhencement to
 support text records contains "\n"
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/MAPREDUCE-6598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083086#comment-15083086 ] 

Jason Lowe commented on MAPREDUCE-6598:
---------------------------------------

LineReader already supports a custom record delimiter. There are a number of constructors that take a byte array specifying the record delimiter bytes.  This in turn is also supported by LineRecordReader which internally uses LineReader.


> LineReader enhencement to support text records contains "\n"
> ------------------------------------------------------------
>
>                 Key: MAPREDUCE-6598
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6598
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mrv2
>    Affects Versions: 2.6.0
>         Environment: RHEL 7, Spark 1.3.1, Hadoop 2.6.0
>            Reporter: cloudyarea
>            Priority: Minor
>
> We have billions of XML message records stored on text files need to be parsed parallel by Spark. By default, Spark open a Hadoop text file using LineReader which provides a single line of text as a record. 
> The XML messages contains "\n" and I believe it is a common scenario - many users have cross-line records. Currently, the solution is to the extend the interface RecordReader.
> To reduce the repeat work, I wrote a class named MessageRecordReader to extend the interface RecordReader, user can set a string as record delimiter, then MessageRecordReader provides a multiple line record to user. 
> I would like to contribute the code to community. Please let me know if you are interested in this simple but useful implementation. 
> Thank you very much and happy new year!


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)