Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E1F6810BAD for ; Fri, 3 Jan 2014 09:22:35 +0000 (UTC) Received: (qmail 12337 invoked by uid 500); 3 Jan 2014 09:22:20 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 12246 invoked by uid 500); 3 Jan 2014 09:22:19 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 12236 invoked by uid 99); 3 Jan 2014 09:22:18 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Jan 2014 09:22:18 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ranjinibecse@gmail.com designates 209.85.217.180 as permitted sender) Received: from [209.85.217.180] (HELO mail-lb0-f180.google.com) (209.85.217.180) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Jan 2014 09:22:11 +0000 Received: by mail-lb0-f180.google.com with SMTP id x18so7768106lbi.25 for ; Fri, 03 Jan 2014 01:21:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=6vYd5JFFBUnIZUxbzrMnujmz5SA8iSBa2GKWexgoNmk=; b=wBsxe3ef4EeRnib4kwPfDEeCvbjpufB8k8d/8y+SCCh1obQPL5tvUPuwd9qGv1J2Yu uqHdYpPKtpFSIiDNt2TyLQwe3cY1Pi9is99h2mdyFblnfqfVzOMJhKdV80+QFjqFIepR wF9++gmWGDxrKtADYbuegvp6DpJUX9Bp61Ea1xzg5Q0XHvaJpFuTEhT8NWDwlzynDRgv AZPA65uXakW+YtAKYFdkNjd9qHTbB305+SD9dtzvCDj+1OS/RRj0zlH5NBsn5ln+IWgd rAjz9IHWraVznV3cVwZorn6O4WoAOGySc+o8qNvuzI7SRJhTSfgVWER1natXmnc+S/29 tLUw== MIME-Version: 1.0 X-Received: by 10.112.52.6 with SMTP id p6mr2105lbo.67.1388740911040; Fri, 03 Jan 2014 01:21:51 -0800 (PST) Received: by 10.152.131.165 with HTTP; Fri, 3 Jan 2014 01:21:50 -0800 (PST) In-Reply-To: References: Date: Fri, 3 Jan 2014 14:51:50 +0530 Message-ID: Subject: Re: XML to TEXT From: Ranjini Rathinam To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a11c3cd4ab00dfc04ef0d7071 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c3cd4ab00dfc04ef0d7071 Content-Type: text/plain; charset=ISO-8859-1 Hi, I used XMLInputFormat , in that i used Record Reader class. Same as u have given THe whole xml is been split into part For Eg: consider the below xml after using the RecordReader class the xml output is the starting and end tag is Emp. it does not convert into text. Please suggest and help. Thanks in advance Ranjini On Fri, Jan 3, 2014 at 11:22 AM, Azuryy Yu wrote: > Hi, > > you can use org.apache.hadoop.streaming.StreamInputFormat using map > reduce to convert XML to text. > > such as your xml like this: > > lll > > > you need to specify stream.recordreader.begin and stream.recordreader.end > in the Configuration: > Configuration conf = new Configuration(); > conf.set("stream.recordreader.begin", ""); > conf.set("stream.recordreader.end", ""); > > > > > > > On Fri, Jan 3, 2014 at 1:16 PM, Ranjini Rathinam wrote: > >> Hi, >> >> Need to convert XML into text using mapreduce. >> >> I have used DOM and SAX parser. >> >> After using SAX Builder in mapper class. the child node act as root >> Element. >> >> While seeing in Sys out i found thar root element is taking the child >> element and printing. >> >> For Eg, >> >> 100RR >> when this xml is passed in mapper , in sys out printing the root element >> >> I am getting the the root element as >> >> >> >> >> Please suggest and help to fix this. >> >> I need to convert the xml into text using mapreduce code. Please provide >> with example. >> >> Required output is >> >> id,name >> 100,RR >> >> Please help. >> >> Thanks in advance, >> Ranjini R >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > > --001a11c3cd4ab00dfc04ef0d7071 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hi,

I used XMLInputFormat , in that i used=A0 Record Reader class. Sam= e as u have given
=A0
THe whole xml is been split into part For Eg: consider the below xml
=A0
<Comp><Emp><id></id><name></name>&= lt;/Emp><Emp><id></id><name></name></Em= p></Comp>
=A0
after using the RecordReader class the xml output is
=A0
<Emp><id></id><name></name></Emp>&= lt;Emp><id></id><name></name></Emp>
=A0
the starting and end tag is Emp.
=A0
it does not convert into text.
=A0
Please suggest and help.
=A0
Thanks in advance
=A0
Ranjini
=A0
On Fri, Jan 3, 2014 at 11:22 AM, Azuryy Yu <az= uryyyu@gmail.com> wrote:
Hi,

you can use org.apache.hadoop.streaming.StreamInputFo= rmat=A0 using map reduce to convert XML to text.

such as your = xml like this:
<xml>
=A0 <name>lll</name&g= t;
</xml>

you need to specify stream.recordreader.beg= in and stream.recordreader.end in the Configuration:
Configuration conf = =3D new Configuration();
conf.set("stream.recordreader.begin&= quot;, "<xml>");
conf.set("stream.recordreader.end", "</xml>");





On Fri, Jan 3, 2014 at 1:16 PM, Ranjini Rathinam= <ranjinibecse@gmail.com> wrote:
Hi,
=A0
Need to convert XML into text using mapreduce.
=A0
I have used DOM and SAX parser.
=A0
After using SAX Builder in mapper class. the child node act as root El= ement.
=A0
While seeing in Sys out i found thar root element is taking the child = element and printing.
=A0
For Eg,
=A0
<Comp><Emp><id>100</id><name>RR</name= ></Emp></Comp>
when this xml is passed in mapper , in sys out printing the root eleme= nt
=A0
I am getting the the root element as
=A0
<id>
<name>
=A0
Please suggest and help to fix this.
=A0
I need to convert the xml into text using mapreduce code. Please provi= de with example.
=A0
Required output is
=A0
id,name
100,RR
=A0
Please help.
=A0
Thanks in advance,
Ranjini R
=A0
=A0
=A0
=A0
=A0
=A0
=A0
=A0
=A0
=A0
=A0
=A0
=A0
=A0
=A0
=A0

<= br> --001a11c3cd4ab00dfc04ef0d7071--