Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C5A3910D29 for ; Wed, 12 Feb 2014 12:33:32 +0000 (UTC) Received: (qmail 81651 invoked by uid 500); 12 Feb 2014 12:33:20 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 81479 invoked by uid 500); 12 Feb 2014 12:33:19 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 81465 invoked by uid 99); 12 Feb 2014 12:33:17 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Feb 2014 12:33:17 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of shankar.hiremath@huawei.com designates 119.145.14.64 as permitted sender) Received: from [119.145.14.64] (HELO szxga01-in.huawei.com) (119.145.14.64) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Feb 2014 12:33:05 +0000 Received: from 172.24.2.119 (EHLO szxeml207-edg.china.huawei.com) ([172.24.2.119]) by szxrg01-dlp.huawei.com (MOS 4.3.7-GA FastPath queued) with ESMTP id BRG51668; Wed, 12 Feb 2014 20:32:39 +0800 (CST) Received: from SZXEML403-HUB.china.huawei.com (10.82.67.35) by szxeml207-edg.china.huawei.com (172.24.2.56) with Microsoft SMTP Server (TLS) id 14.3.158.1; Wed, 12 Feb 2014 20:32:37 +0800 Received: from nkgeml405-hub.china.huawei.com (10.98.56.36) by szxeml403-hub.china.huawei.com (10.82.67.35) with Microsoft SMTP Server (TLS) id 14.3.158.1; Wed, 12 Feb 2014 20:32:38 +0800 Received: from nkgeml510-mbs.china.huawei.com ([169.254.4.121]) by nkgeml405-hub.china.huawei.com ([10.98.56.36]) with mapi id 14.03.0158.001; Wed, 12 Feb 2014 20:32:35 +0800 From: Shankar hiremath To: "user@hadoop.apache.org" Subject: RE: XML to TEXT Thread-Topic: XML to TEXT Thread-Index: AQHPJ8q5Q1+x3RLOwEGZIj/bANvTtJqxi1KA Date: Wed, 12 Feb 2014 12:32:34 +0000 Message-ID: References: In-Reply-To: Accept-Language: en-IN, zh-CN, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.18.147.75] Content-Type: multipart/alternative; boundary="_000_CB0197544979BE458C310023C8F4505B09164230nkgeml510mbschi_" MIME-Version: 1.0 X-CFilter-Loop: Reflected X-Virus-Checked: Checked by ClamAV on apache.org --_000_CB0197544979BE458C310023C8F4505B09164230nkgeml510mbschi_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable As per my understanding reading one xml line and sending to map task will n= ot work generally, I suggest to make or partition the xml data as one complete "student" eleme= nt as per xml specification, Then pass each partitioned "student" xml element as input to mapper, and ma= pper will parse this xml and generate the text data (here you can reuse you= r existing recursive code) in a single line. Ex; 100 ranjini-1 .............................................= ... The above student element should be sent to mapper-1 101 ranjini-2 .............................................= ... The above student element should be sent to mapper-2 Complete XML: 100 ranjini-1 .............................................= ... 101 ranjini-2 .............................................= ... ........ From: Ranjini Rathinam [mailto:ranjinibecse@gmail.com] Sent: 12 February 2014 PM 01:46 To: user@hadoop.apache.org Subject: Fwd: XML to TEXT Please help to convert this xml to text. I have the attached the xml. Please find the attachement. Some student has two address tag and some student has one address tag and s= ome student dont have address tag tag. I need to convert the xml into string. this is my desired output. 100,ranjini,HOME,a street,ad street,ads street,chennai,tn,OFFICE,adsja1 str= eet,adsja2 street,adsja3 street,mumbai,Maharastra 101,nivetha,HOME,a street,ad street,ads street,chennai,tn 102,siva In normal java i have written using recursion but how to write in mapreduce= . How to write the code in Mapreduce .? Pl help . Thanks in advance. Regards, Ranjini R On Fri, Jan 10, 2014 at 12:47 PM, Ranjini Rathinam > wrote: Hi, Its working fine. problem was in xml . THe space i have given. Thanks a lot. Regards, Ranjini.R On Thu, Jan 9, 2014 at 10:47 PM, Diego Gutierrez > wrote: Hi, I'm sending you the eclipse project with the code. Hope this helps. Regards Diego Guti=E9rrez 2014/1/9 Ranjini Rathinam > Hi, I am using here java 1.6 and hadoop 0.20 version , ubuntu 12.04. If possible please send the jar and code for review. Thanks for the support, Ranjini On Wed, Jan 8, 2014 at 11:00 PM, Diego Gutierrez > wrote: Hi, I've notice that your xml file has break lines. Hadoop by default splits ev= ery file into lines and pass them to the map function, in other words, each= map function process one line of the file. Please remove the break lines f= rom your xml and try again. I've tested here with your xml file(just changi= ng DTMNodeList list =3D (DTMNodeList) getNode("/Company/Employee", doc, XPathConstants.NODESET) ) and this is the output in res= ult.txt id,name 100,ranjini,IT1,123456,nextlevel1,Chennai1Navallur1 1001,ranjinikumar,IT,1234516,nextlevel,ChennaiNavallur Note: I dont know if the java version or hadoop version can be the problem = here. I'm using ubuntu 12.04, java oracle 7 and hadoop 2.2.0. If you want, I can send you the jar file with the code :) Regards Diego Guti=E9rrez. 2014/1/7 Ranjini Rathinam > Hi Gutierrez , As suggest i tried with the code , but in the result.txt i got output only = header. Nothing else was printing. After debugging i came to know that while parsing , there is no value. The problem is in line given below which is bold. While putting SysOut i fo= und no value printing in this line. String xmlContent =3D value.toString(); InputStream is =3D new ByteArrayInputStream(xmlContent.getBytes()); DocumentBuilderFactory factory =3D DocumentBuilderFactory.newInstan= ce(); DocumentBuilder builder; try { builder =3D factory.newDocumentBuilder(); Document doc =3D builder.parse(is); String ed=3Ddoc.getDocumentElement().getNodeName(); out.write(ed.getBytes()); DTMNodeList list =3D (DTMNodeList) getNode("/Company/Employee",= doc,XPathConstants.NODESET); When iam printing out.write(xmlContent.getBytes):- the whole xml is being printed. then i wrote for Sysout for list ,nothing printed. out.write(ed.getBytes):- nothing is being printed. Please suggest where i am going wrong. Please help to fix this. Thanks in advance. I have attached my code.Please review. Mapper class:- public class XmlTextMapper extends Mapper { private static final XPathFactory xpathFactory =3D XPathFactory.newInst= ance(); @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String resultFileName =3D "/user/task/Sales/result.txt"; Configuration conf =3D new Configuration(); FileSystem fs =3D FileSystem.get(URI.create(resultFileName), conf); FSDataOutputStream out =3D fs.create(new Path(resultFileName)); InputStream resultIS =3D new ByteArrayInputStream(new byte[0]); String header =3D "id,name\n"; out.write(header.getBytes()); String xmlContent =3D value.toString(); InputStream is =3D new ByteArrayInputStream(xmlContent.getBytes()); DocumentBuilderFactory factory =3D DocumentBuilderFactory.newInstan= ce(); DocumentBuilder builder; try { builder =3D factory.newDocumentBuilder(); Document doc =3D builder.parse(is); String ed=3Ddoc.getDocumentElement().getNodeName(); out.write(ed.getBytes()); DTMNodeList list =3D (DTMNodeList) getNode("/Company/Employee",= doc,XPathConstants.NODESET); int size =3D list.getLength(); for (int i =3D 0; i < size; i++) { Node node =3D list.item(i); String line =3D ""; NodeList nodeList =3D node.getChildNodes(); int childNumber =3D nodeList.getLength(); for (int j =3D 0; j < childNumber; j++) { line +=3D nodeList.item(j).getTextContent() + ","; } if (line.endsWith(",")) line =3D line.substring(0, line.length() - 1); line +=3D "\n"; out.write(line.getBytes()); } } catch (ParserConfigurationException e) { e.printStackTrace(); } catch (SAXException e) { e.printStackTrace(); } catch (XPathExpressionException e) { e.printStackTrace(); } IOUtils.copyBytes(resultIS, out, 4096, true); out.close(); } public static Object getNode(String xpathStr, Node node, QName retunTyp= e) throws XPathExpressionException { XPath xpath =3D xpathFactory.newXPath(); return xpath.evaluate(xpathStr, node, retunType); } } Main class public class MainXml { public static void main(String[] args) throws Exception { Configuration conf =3D new Configuration(); if (args.length !=3D 2) { System.err .println("Usage: XMLtoText ")= ; System.exit(-1); } String output=3D"/user/task/Sales/"; Job job =3D new Job(conf, "XML to Text"); job.setJarByClass(MainXml.class); // job.setJobName("XML to Text"); FileInputFormat.addInputPath(job, new Path(args[0])); // FileOutputFormat.setOutputPath(job, new Path(args[1])); Path outPath =3D new Path(output); FileOutputFormat.setOutputPath(job, outPath); FileSystem dfs =3D FileSystem.get(outPath.toUri(), conf); if (dfs.exists(outPath)) { dfs.delete(outPath, true); } job.setMapperClass(XmlTextMapper.class); job.setNumReduceTasks(0); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); System.exit(job.waitForCompletion(true) ? 0 : 1); } } My xml file 100 ranjini IT1 123456 nextlevel1
Chennai1 Navallur1
1001 ranjinikumar IT 1234516 nextlevel
Chennai Navallur
Thanks in advance. Ranjini On Mon, Jan 6, 2014 at 2:44 PM, Ranjini Rathinam > wrote: Hi, Thanks a lot . Ranjini On Fri, Jan 3, 2014 at 10:40 PM, Diego Gutierrez > wrote: Hi, I suggest to use the XPath, this is a native java support for parse xml and= json formats. For the main problem, like distcp command( http://hadoop.apache.org/docs/r0= .19.0/distcp.pdf ) there is no need of a reduce function, because you can p= arse the xml input file and create the file you need in the map function.Fo= r example the following code reads an xml file in HDFS, parse it and create= a new file ( "/result.txt" ) with the expected format: id,name 100,RR Mapper function: import java.io.ByteArrayInputStream; import java.io.IOException; import java.io.InputStream; import java.net.URI; import javax.xml.namespace.QName; import javax.xml.parsers.DocumentBuilder; import javax.xml.parsers.DocumentBuilderFactory; import javax.xml.parsers.ParserConfigurationException; import javax.xml.xpath.XPath; import javax.xml.xpath.XPathConstants; import javax.xml.xpath.XPathExpressionException; import javax.xml.xpath.XPathFactory; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FSDataOutputStream; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IOUtils; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import org.w3c.dom.Document; import org.w3c.dom.Node; import org.w3c.dom.NodeList; import org.xml.sax.SAXException; import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList; public class XmlToTextMapper extends Mapper= { private static final XPathFactory xpathFactory =3D XPathFactory.newInst= ance(); @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String resultFileName =3D "/result.txt"; Configuration conf =3D new Configuration(); FileSystem fs =3D FileSystem.get(URI.create(resultFileName), conf); FSDataOutputStream out =3D fs.create(new Path(resultFileName)); InputStream resultIS =3D new ByteArrayInputStream(new byte[0]); String header =3D "id,name\n"; out.write(header.getBytes()); String xmlContent =3D value.toString(); InputStream is =3D new ByteArrayInputStream(xmlContent.getBytes()); DocumentBuilderFactory factory =3D DocumentBuilderFactory.newInstan= ce(); DocumentBuilder builder; try { builder =3D factory.newDocumentBuilder(); Document doc =3D builder.parse(is); DTMNodeList list =3D (DTMNodeList) getNode("/main/data", doc, XPathConstants.NODESET); int size =3D list.getLength(); for (int i =3D 0; i < size; i++) { Node node =3D list.item(i); String line =3D ""; NodeList nodeList =3D node.getChildNodes(); int childNumber =3D nodeList.getLength(); for (int j =3D 0; j < childNumber; j++) { line +=3D nodeList.item(j).getTextContent() + ","; } if (line.endsWith(",")) line =3D line.substring(0, line.length() - 1); line +=3D "\n"; out.write(line.getBytes()); } } catch (ParserConfigurationException e) { MyLogguer.log("error: " + e.getMessage()); e.printStackTrace(); } catch (SAXException e) { MyLogguer.log("error: " + e.getMessage()); e.printStackTrace(); } catch (XPathExpressionException e) { MyLogguer.log("error: " + e.getMessage()); e.printStackTrace(); } IOUtils.copyBytes(resultIS, out, 4096, true); out.close(); } public static Object getNode(String xpathStr, Node node, QName retunTyp= e) throws XPathExpressionException { XPath xpath =3D xpathFactory.newXPath(); return xpath.evaluate(xpathStr, node, retunType); } } -------------------------------------- Main class: public class Main { public static void main(String[] args) throws Exception { if (args.length !=3D 2) { System.err .println("Usage: XMLtoText ")= ; System.exit(-1); } Job job =3D new Job(); job.setJarByClass(Main.class); job.setJobName("XML to Text"); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(XmlToTextMapper.class); job.setNumReduceTasks(0); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); System.exit(job.waitForCompletion(true) ? 0 : 1); } } To execute the job you can use : bin/hadoop Main /data.xml /output. Then you can use this to see result.txt file: hadoop fs -cat /result.txt I'm using this xml as input: 1NameA2NameB and the content in result.txt is like this: id,name 1,NameA 2,NameB Hope this helps. 2014/1/3 Ranjini Rathinam > Hi, Need to convert XML into text using mapreduce. I have used DOM and SAX parser. After using SAX Builder in mapper class. the child node act as root Element= . While seeing in Sys out i found thar root element is taking the child eleme= nt and printing. For Eg, 100RR when this xml is passed in mapper , in sys out printing the root element I am getting the the root element as Please suggest and help to fix this. I need to convert the xml into text using mapreduce code. Please provide wi= th example. Required output is id,name 100,RR Please help. Thanks in advance, Ranjini R --_000_CB0197544979BE458C310023C8F4505B09164230nkgeml510mbschi_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable

As per my understanding r= eading one xml line and sending to map task will not work generally,

I suggest to make or part= ition the xml data as one complete “studentelement as per xml specification,=

Then pass each = partitioned “studentxml element as input to mapper, and ma= pper will parse this xml and generate the text data (here you can reuse you= r existing recursive code) in a single line.

 

Ex;

     &nb= sp;    <student>

     &n= bsp;            = ;      <= id= >= 100= <= /i= d>= ;

          &n= bsp;            = ; <= na= me&g= t;r= anjini-1</name>

    &= nbsp;           &nbs= p;             = …………………………̷= 0;……………

     &n= bsp;      <= /s= tudent>

The above student element should be sent to mapper-1

 

<student>

     &n= bsp;            = ;      <= id= >= 101= <= /i= d>= ;

          &n= bsp;            = ; <= na= me&g= t;r= anjini-2</name>

    &= nbsp;           &nbs= p;             = …………………………̷= 0;……………

     &n= bsp;      <= /s= tudent>

The above student element should be sent to mapper-2

 

 

 

Complete XML:<= /span>

 <= /p>

<school>

     &n= bsp;      <= st= udent>

     &n= bsp;            = ;      <= id= >= 100= <= /i= d>= ;

          &n= bsp;            = ; <= na= me&g= t;r= anjini-1</name>

    &= nbsp;           &nbs= p;             = …………………………̷= 0;……………

     &n= bsp;      <= /s= tudent>

<student>

     &n= bsp;            = ;      <= id= >= 101= <= /i= d>= ;

          &n= bsp;            = ; <= na= me&g= t;r= anjini-2</name>

    &= nbsp;           &nbs= p;             = …………………………̷= 0;……………

     &n= bsp;      <= /s= tudent>

    &= nbsp;        ……..=

</school>

 <= /p>

 <= /p>

 <= /p>

From: Ranjini = Rathinam [mailto:ranjinibecse@gmail.com]
Sent: 12 February 2014 PM 01:46
To: user@hadoop.apache.org
Subject: Fwd: XML to TEXT

 

 

 

Please help to conver= t this xml to text.

 

I have the attached the xml. Please find the at= tachement.

Some student has two address tag and some student has one address tag and s= ome student dont have address tag tag.

I need to convert the xml into string.

this is my desired output.

100,ranjini,HOME,a street,ad street,ads street,chennai,tn,OFFICE,adsja1 str= eet,adsja2 street,adsja3 street,mumbai,Maharastra

101,nivetha,HOME,a street,ad street,ads street,chenn= ai,tn
102,siva



In normal java i have written using recursion but how to write in mapreduce= .

How to write the code in Mapreduce .? Pl help .

Thanks in advance.

Regards,

Ranjini R

 

 

On Fri, Jan 10, 2014 at 12:47 PM, Ranjini Rathinam &= lt;ranjinibecse= @gmail.com> wrote:

Hi,

 

Its working fine. problem was in xml . THe space i h= ave given.

 

Thanks a lot.

 

Regards,

Ranjini.R<= /p>

On Thu, Jan 9, 2014 at 10:47 PM, Diego Gutierrez <= ;diego.gut= ierrez@ucsp.edu.pe> wrote:

Hi,

I'm sending you the e= clipse project with the code. Hope this helps.

Regards
Diego Guti=E9rrez

 

 

2014/1/9 Ranjini Rathinam <ranjinibecse@gmail.com>

Hi,

 

I am using here java 1.6 and hadoop 0.20 version ,&n= bsp; ubuntu 12.04.

 

If possible please send the jar and code for review.=

 

Thanks for the support,

 

Ranjini

On Wed, Jan 8, 2014 at 11:00 PM, Diego Gutierrez <= ;diego.gut= ierrez@ucsp.edu.pe> wrote:

Hi,

I've notice that your= xml file has break lines. Hadoop by default splits every file into lines a= nd pass them to the map function, in other words, each map function process= one line of the file. Please remove the break lines from your xml and try again. I've tested here with your xm= l file(just changing DTMNodeList list =3D (DTMNodeList) getNode("/Comp= any/Employee", doc,
                = ;    XPathConstants.NODESET) ) and this is the output in res= ult.txt


id,name
100,ranjini,IT1,123456,nextlevel1,Chennai1Navallur1
1001,ranjinikumar,IT,1234516,nextlevel,ChennaiNavallur

Note: I dont know if = the java version or hadoop version can be the problem here. I'm using ubunt= u 12.04, java oracle 7 and hadoop 2.2.0.

If you want, I can send you the jar file with the co= de :)

 

Regards

Diego Guti=E9rrez.

 

 

2014/1/7 Ranjini Rathinam <ranjinibecse@gmail.com>

Hi Gutierrez ,

 

As suggest i tried with the code , but in the result= .txt i got output only header. Nothing else was printing.

 

After debugging i came to know that while parsing , = there is no value.

 

The problem is in line given below which is bold. Wh= ile putting SysOut i found no value printing in this line.

 

String xmlContent =3D value.toString();
  
        InputStream is =3D new ByteArray= InputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory = =3D DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = =3D factory.newDocumentBuilder();
            = Document doc =3D builder.parse(is);

   String ed=3Ddoc.getDocumentElement= ().getNodeName();
   out.write(ed.getBytes());
            DTMNodeL= ist list =3D (DTMNodeList) getNode("/Company/Employee", doc,XPath= Constants.NODESET);

 

When iam printing

 

out.write(xmlContent.getBytes):- the whole xml is be= ing printed.

 

then i wrote for Sysout for list ,nothing printed. <= o:p>

out.write(ed.getBytes):- nothing is being printed.

 

Please suggest where i am going wrong. Please help t= o fix this.

 

Thanks in advance.


I have attached my code.Please review.

 

 

Mapper class:-

 

public class XmlTextMapper extends Mapper<LongWri= table, Text, Text, Text> {

    private static final XPathFactory= xpathFactory =3D XPathFactory.newInstance();

    @Override
    public void map(LongWritable key, Text value, Context co= ntext)
            throws I= OException, InterruptedException {

        String re= sultFileName =3D "/user/task/Sales/result.txt";


        Configuration conf =3D new Confi= guration();
        FileSystem fs =3D FileSystem.get= (URI.create(resultFileName), conf);
        FSDataOutputStream out =3D fs.cr= eate(new Path(resultFileName));

        InputStre= am resultIS =3D new ByteArrayInputStream(new byte[0]);

        String he= ader =3D "id,name\n";
        out.write(header.getBytes());

        String xm= lContent =3D value.toString();
  
        InputStream is =3D new ByteArray= InputStream(xmlContent.getBytes());
        DocumentBuilderFactory factory = =3D DocumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder = =3D factory.newDocumentBuilder();
            Document= doc =3D builder.parse(is);

   String ed=3Ddoc.getDocumentElement= ().getNodeName();
   out.write(ed.getBytes());
            DTMNodeL= ist list =3D (DTMNodeList) getNode("/Company/Employee", doc,XPath= Constants.NODESET);

        &nbs= p;   int size =3D list.getLength();
            for (int= i =3D 0; i < size; i++) {
            &nb= sp;   Node node =3D list.item(i);
            &nb= sp;   String line =3D "";
            &nb= sp;   NodeList nodeList =3D node.getChildNodes();
            &nb= sp;   int childNumber =3D nodeList.getLength();
            &nb= sp;   for (int j =3D 0; j < childNumber; j++)
    {
            &nb= sp;       line +=3D nodeList.item(j).getT= extContent() + ",";
            &nb= sp;   }
            &nb= sp;   if (line.endsWith(","))
            &nb= sp;       line =3D line.substring(0, line.len= gth() - 1);
            &nb= sp;   line +=3D "\n";
            &nb= sp;   out.write(line.getBytes());

        &nbs= p;   }

        } catch (= ParserConfigurationException e) {
             e.= printStackTrace();
        } catch (SAXException e) {
             e.= printStackTrace();
        } catch (XPathExpressionExceptio= n e) {
             e.= printStackTrace();
        }

        IOUtils.c= opyBytes(resultIS, out, 4096, true);
        out.close();
    }

    public static Object getNode(Stri= ng xpathStr, Node node, QName retunType)
            throws X= PathExpressionException {
        XPath xpath =3D xpathFactory.new= XPath();
        return xpath.evaluate(xpathStr, = node, retunType);
    }
}

 

 

 

Main class

public class MainXml {

    public static void main(String[] = args) throws Exception {

Configuration conf =3D new Configuration();

        if (args.= length !=3D 2) {
            System.e= rr
            &nb= sp;       .println("Usage: XMLtoText <= ;input path> <output path>");
            System.e= xit(-1);
        }

  String output=3D"/user/task/Sales/&= quot;;
       Job job =3D new Job(conf, "XML to= Text");
        job.setJarByClass(MainXml.class)= ;
       // job.setJobName("XML to Text&qu= ot;);


        FileInputFormat.addInputPath(job= , new Path(args[0]));

       // FileOutputFo= rmat.setOutputPath(job, new Path(args[1]));
  Path outPath =3D new Path(output);
  FileOutputFormat.setOutputPath(job, outPath);
  FileSystem dfs =3D FileSystem.get(outPath.toUri(), conf);
  if (dfs.exists(outPath)) {
  dfs.delete(outPath, true);
  }
        job.setMapperClass(XmlTextMapper= .class);


        job.setNumReduceTasks(0);
        job.setMapOutputKeyClass(Text.cl= ass);
        job.setMapOutputValueClass(Text.= class);
        System.exit(job.waitForCompletio= n(true) ? 0 : 1);

    }
}

 

 

My xml file

 

<Company>
<Employee>
<id>100</id>
<ename>ranjini</ename>
<dept>IT1</dept>
<sal>123456</sal>
<location>nextlevel1</location>
<Address>
<Home>Chennai1</Home>
<Office>Navallur1</Office>
</Address>
</Employee>
<Employee>
<id>1001</id>
<ename>ranjinikumar</ename>
<dept>IT</dept>
<sal>1234516</sal>
<location>nextlevel</location>
<Address>
<Home>Chennai</Home>
<Office>Navallur</Office>
</Address>
</Employee>
</Company>

 

 

Thanks in advance.

 

Ranjini


 

On Mon, Jan 6, 2014 at 2:44 PM, Ranjini Rathinam <= ;ranjinibecse@g= mail.com> wrote:

Hi,

 

Thanks a lot .

 

Ranjini

On Fri, Jan 3, 2014 at 10:40 PM, Diego Gutierrez <= ;diego.gut= ierrez@ucsp.edu.pe> wrote:

Hi,

I suggest to use the = XPath, this is a native java support for parse xml and json formats.

For the main problem, like distcp command( http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no need of = a reduce function, because you can parse the xml input file and create the = file you need in the map function.For example the following code reads an x= ml file in HDFS, parse it and create a new file ( "/result.txt" ) with the expected format: 

id,name

100,RR

Mapper function:

import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.URI;

import javax.xml.namespace.QName;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;

public class XmlToTextMapper extends Mapper<LongWritable, Text, Text, Te= xt> {

    private static final XPathFactory xpathFactory =3D XPath= Factory.newInstance();

    @Override
    public void map(LongWritable key, Text value, Context co= ntext)
            throws IOException= , InterruptedException {

        String resultFileName =3D "/resu= lt.txt";



        Configuration conf =3D new Configurat= ion();

        FileSystem fs = =3D FileSystem.get(URI.create(resultFileName), conf);
        FSDataOutputStream out =3D fs.create(= new Path(resultFileName));

        InputStream resultIS =3D new ByteArra= yInputStream(new byte[0]);

        String header =3D "id,name\n&quo= t;;
        out.write(header.getBytes());

        String xmlContent =3D value.toString(= );
        InputStream is =3D new ByteArrayInput= Stream(xmlContent.getBytes());
        DocumentBuilderFactory factory =3D Do= cumentBuilderFactory.newInstance();
        DocumentBuilder builder;
        try {
            builder =3D factor= y.newDocumentBuilder();
            Document doc =3D b= uilder.parse(is);
            DTMNodeList list = =3D (DTMNodeList) getNode("/main/data", doc,
               =     XPathConstants.NODESET);

            int size =3D list.= getLength();
            for (int i =3D 0; = i < size; i++) {
               = Node node =3D list.item(i);
               = String line =3D "";
               = NodeList nodeList =3D node.getChildNodes();
               = int childNumber =3D nodeList.getLength();
               = for (int j =3D 0; j < childNumber; j++) {
               =     line +=3D nodeList.item(j).getTextContent() + &= quot;,";
               = }
               = if (line.endsWith(","))
               =     line =3D line.substring(0, line.length() - 1);
               = line +=3D "\n";
               = out.write(line.getBytes());

            }

        } catch (ParserConfigurationException= e) {
            MyLogguer.log(&quo= t;error: " + e.getMessage());
            e.printStackTrace(= );
        } catch (SAXException e) {
            MyLogguer.log(&quo= t;error: " + e.getMessage());
            e.printStackTrace(= );
        } catch (XPathExpressionException e) = {
            MyLogguer.log(&quo= t;error: " + e.getMessage());
            e.printStackTrace(= );
        }

        IOUtils.copyBytes(resultIS, out, 4096= , true);
        out.close();
    }

    public static Object getNode(String xpathStr, Node node,= QName retunType)
            throws XPathExpres= sionException {
        XPath xpath =3D xpathFactory.newXPath= ();
        return xpath.evaluate(xpathStr, node,= retunType);
    }
}



--------------------------------------

Main class:


public class Main {

    public static void main(String[] args) throws Exception = {

        if (args.length !=3D 2) {
            System.err
               =     .println("Usage: XMLtoText <input path> <= output path>");
            System.exit(-1);         }

        Job job =3D new Job();
        job.setJarByClass(Main.class);
        job.setJobName("XML to Text"= ;);
        FileInputFormat.addInputPath(job, new= Path(args[0]));
        FileOutputFormat.setOutputPath(job, n= ew Path(args[1]));

        job.setMapperClass(XmlToTextMapper.cl= ass);
        job.setNumReduceTasks(0);
        job.setMapOutputKeyClass(Text.class);=
        job.setMapOutputValueClass(Text.class= );
        System.exit(job.waitForCompletion(tru= e) ? 0 : 1);

    }
}

 

To execute the job yo= u can use :

         bin/hadoop Main /data.xml = /output.

Then you can use this= to see result.txt file:

          hadoop fs -cat /resu= lt.txt

I'm using this xml as= input:

<Comp><Emp><id>1</id><name>NameA</name>= </data><data><id>2</id><name>NameB</name&g= t;</Emp></Comp>

and the content in re= sult.txt is like this:

id,name
1,NameA
2,NameB

Hope this helps.

 

2014/1/3 Ranjini Rathinam <ranjinibecse@gmail.com>

Hi,

 

Need to convert XML into text using mapreduce.<= /o:p>

 

I have used DOM and SAX parser.

 

After using SAX Builder in mapper class. the child n= ode act as root Element.

 

While seeing in Sys out i found thar root element is= taking the child element and printing.

 

For Eg,

 

<Comp><Emp><id>100</id><n= ame>RR</name></Emp></Comp>

when this xml is passed in mapper , in sys out print= ing the root element

 

I am getting the the root element as

 

<id>

<name>

 

Please suggest and help to fix this.

 

I need to convert the xml into text using mapreduce = code. Please provide with example.

 

Required output is

 

id,name

100,RR

 

Please help.

 

Thanks in advance,

Ranjini R

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

--_000_CB0197544979BE458C310023C8F4505B09164230nkgeml510mbschi_--