hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ranjini Rathinam <ranjinibe...@gmail.com>
Subject Re: XML to TEXT
Date Wed, 08 Jan 2014 06:34:00 GMT
Hi,

I am using hive. As suggest i am using xpath in select clause, but the
error is coming as invalid expression.

Please give some sample xml to process xml in hive.

Thanks in advance

Ranjini

On Tue, Jan 7, 2014 at 5:14 PM, Ranjini Rathinam <ranjinibecse@gmail.com>wrote:

> Hi Gutierrez ,
>
> As suggest i tried with the code , but in the result.txt i got output only
> header. Nothing else was printing.
>
> After debugging i came to know that while parsing , there is no value.
>
> The problem is in line given below which is bold. While putting SysOut i
> found no value printing in this line.
>
>  String xmlContent = value.toString();
>
>         InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
>         DocumentBuilderFactory factory =
> DocumentBuilderFactory.newInstance();
>         DocumentBuilder builder;
>         try {
>             builder = factory.newDocumentBuilder();
>
> * Document doc = builder.parse(is);*
>    String ed=doc.getDocumentElement().getNodeName();
>    out.write(ed.getBytes());
>             DTMNodeList list = (DTMNodeList) getNode("/Company/Employee",
> doc,XPathConstants.NODESET);
>
> When iam printing
>
> out.write(xmlContent.getBytes):- the whole xml is being printed.
>
> then i wrote for Sysout for list ,nothing printed.
>  out.write(ed.getBytes):- nothing is being printed.
>
> Please suggest where i am going wrong. Please help to fix this.
>
> Thanks in advance.
>
> I have attached my code.Please review.
>
>
> Mapper class:-
>
> public class XmlTextMapper extends Mapper<LongWritable, Text, Text, Text> {
>      private static final XPathFactory xpathFactory =
> XPathFactory.newInstance();
>     @Override
>     public void map(LongWritable key, Text value, Context context)
>             throws IOException, InterruptedException {
>         String resultFileName = "/user/task/Sales/result.txt";
>
>         Configuration conf = new Configuration();
>         FileSystem fs = FileSystem.get(URI.create(resultFileName), conf);
>         FSDataOutputStream out = fs.create(new Path(resultFileName));
>         InputStream resultIS = new ByteArrayInputStream(new byte[0]);
>         String header = "id,name\n";
>         out.write(header.getBytes());
>          String xmlContent = value.toString();
>
>         InputStream is = new ByteArrayInputStream(xmlContent.getBytes());
>         DocumentBuilderFactory factory =
> DocumentBuilderFactory.newInstance();
>         DocumentBuilder builder;
>         try {
>             builder = factory.newDocumentBuilder();
>             Document doc = builder.parse(is);
>    String ed=doc.getDocumentElement().getNodeName();
>    out.write(ed.getBytes());
>             DTMNodeList list = (DTMNodeList) getNode("/Company/Employee",
> doc,XPathConstants.NODESET);
>              int size = list.getLength();
>             for (int i = 0; i < size; i++) {
>                 Node node = list.item(i);
>                 String line = "";
>                 NodeList nodeList = node.getChildNodes();
>                 int childNumber = nodeList.getLength();
>                 for (int j = 0; j < childNumber; j++)
>     {
>                     line += nodeList.item(j).getTextContent() + ",";
>                 }
>                 if (line.endsWith(","))
>                     line = line.substring(0, line.length() - 1);
>                 line += "\n";
>                 out.write(line.getBytes());
>             }
>         } catch (ParserConfigurationException e) {
>              e.printStackTrace();
>         } catch (SAXException e) {
>              e.printStackTrace();
>         } catch (XPathExpressionException e) {
>              e.printStackTrace();
>         }
>          IOUtils.copyBytes(resultIS, out, 4096, true);
>         out.close();
>     }
>     public static Object getNode(String xpathStr, Node node, QName
> retunType)
>             throws XPathExpressionException {
>         XPath xpath = xpathFactory.newXPath();
>         return xpath.evaluate(xpathStr, node, retunType);
>     }
> }
>
>
>
> Main class
> public class MainXml {
>      public static void main(String[] args) throws Exception {
> Configuration conf = new Configuration();
>         if (args.length != 2) {
>             System.err
>                     .println("Usage: XMLtoText <input path> <output
> path>");
>             System.exit(-1);
>         }
>   String output="/user/task/Sales/";
>        Job job = new Job(conf, "XML to Text");
>         job.setJarByClass(MainXml.class);
>        // job.setJobName("XML to Text");
>
>         FileInputFormat.addInputPath(job, new Path(args[0]));
>        // FileOutputFormat.setOutputPath(job, new Path(args[1]));
>   Path outPath = new Path(output);
>   FileOutputFormat.setOutputPath(job, outPath);
>   FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
>   if (dfs.exists(outPath)) {
>   dfs.delete(outPath, true);
>   }
>         job.setMapperClass(XmlTextMapper.class);
>
>         job.setNumReduceTasks(0);
>         job.setMapOutputKeyClass(Text.class);
>         job.setMapOutputValueClass(Text.class);
>         System.exit(job.waitForCompletion(true) ? 0 : 1);
>     }
> }
>
>
> My xml file
>
> <Company>
> <Employee>
> <id>100</id>
> <ename>ranjini</ename>
> <dept>IT1</dept>
> <sal>123456</sal>
> <location>nextlevel1</location>
> <Address>
> <Home>Chennai1</Home>
> <Office>Navallur1</Office>
> </Address>
> </Employee>
> <Employee>
> <id>1001</id>
> <ename>ranjinikumar</ename>
> <dept>IT</dept>
> <sal>1234516</sal>
> <location>nextlevel</location>
> <Address>
> <Home>Chennai</Home>
> <Office>Navallur</Office>
> </Address>
> </Employee>
> </Company>
>
>
> Thanks in advance.
>
> Ranjini
>
>
>
>>  On Mon, Jan 6, 2014 at 2:44 PM, Ranjini Rathinam <ranjinibecse@gmail.com
>> > wrote:
>>
>>> Hi,
>>>
>>> Thanks a lot .
>>>
>>> Ranjini
>>>
>>> On Fri, Jan 3, 2014 at 10:40 PM, Diego Gutierrez <
>>> diego.gutierrez@ucsp.edu.pe> wrote:
>>>
>>>>  Hi,
>>>>
>>>> I suggest to use the XPath, this is a native java support for parse xml
>>>> and json formats.
>>>>
>>>> For the main problem, like distcp command(
>>>> http://hadoop.apache.org/docs/r0.19.0/distcp.pdf ) there is no need of
>>>> a reduce function, because you can parse the xml input file and create the
>>>> file you need in the map function.For example the following code reads an
>>>> xml file in HDFS, parse it and create a new file ( "/result.txt" ) with the
>>>> expected format:
>>>> id,name
>>>> 100,RR
>>>>
>>>>
>>>> Mapper function:
>>>>
>>>> import java.io.ByteArrayInputStream;
>>>> import java.io.IOException;
>>>> import java.io.InputStream;
>>>> import java.net.URI;
>>>>
>>>> import javax.xml.namespace.QName;
>>>> import javax.xml.parsers.DocumentBuilder;
>>>> import javax.xml.parsers.DocumentBuilderFactory;
>>>> import javax.xml.parsers.ParserConfigurationException;
>>>> import javax.xml.xpath.XPath;
>>>> import javax.xml.xpath.XPathConstants;
>>>> import javax.xml.xpath.XPathExpressionException;
>>>> import javax.xml.xpath.XPathFactory;
>>>>
>>>> import org.apache.hadoop.conf.Configuration;
>>>> import org.apache.hadoop.fs.FSDataOutputStream;
>>>> import org.apache.hadoop.fs.FileSystem;
>>>> import org.apache.hadoop.fs.Path;
>>>> import org.apache.hadoop.io.IOUtils;
>>>> import org.apache.hadoop.io.LongWritable;
>>>> import org.apache.hadoop.io.Text;
>>>> import org.apache.hadoop.mapreduce.Mapper;
>>>> import org.w3c.dom.Document;
>>>> import org.w3c.dom.Node;
>>>> import org.w3c.dom.NodeList;
>>>> import org.xml.sax.SAXException;
>>>>
>>>> import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;
>>>>
>>>> public class XmlToTextMapper extends Mapper<LongWritable, Text, Text,
>>>> Text> {
>>>>
>>>>     private static final XPathFactory xpathFactory =
>>>> XPathFactory.newInstance();
>>>>
>>>>     @Override
>>>>     public void map(LongWritable key, Text value, Context context)
>>>>             throws IOException, InterruptedException {
>>>>
>>>>         String resultFileName = "/result.txt";
>>>>
>>>>
>>>>         Configuration conf = new Configuration();
>>>>         FileSystem fs = FileSystem.get(URI.create(resultFileName),
>>>> conf);
>>>>         FSDataOutputStream out = fs.create(new Path(resultFileName));
>>>>
>>>>         InputStream resultIS = new ByteArrayInputStream(new byte[0]);
>>>>
>>>>         String header = "id,name\n";
>>>>         out.write(header.getBytes());
>>>>
>>>>         String xmlContent = value.toString();
>>>>         InputStream is = new
>>>> ByteArrayInputStream(xmlContent.getBytes());
>>>>         DocumentBuilderFactory factory =
>>>> DocumentBuilderFactory.newInstance();
>>>>         DocumentBuilder builder;
>>>>         try {
>>>>             builder = factory.newDocumentBuilder();
>>>>             Document doc = builder.parse(is);
>>>>             DTMNodeList list = (DTMNodeList) getNode("/main/data", doc,
>>>>                     XPathConstants.NODESET);
>>>>
>>>>             int size = list.getLength();
>>>>             for (int i = 0; i < size; i++) {
>>>>                 Node node = list.item(i);
>>>>                 String line = "";
>>>>                 NodeList nodeList = node.getChildNodes();
>>>>                 int childNumber = nodeList.getLength();
>>>>                 for (int j = 0; j < childNumber; j++) {
>>>>                     line += nodeList.item(j).getTextContent() + ",";
>>>>                 }
>>>>                 if (line.endsWith(","))
>>>>                     line = line.substring(0, line.length() - 1);
>>>>                 line += "\n";
>>>>                 out.write(line.getBytes());
>>>>
>>>>             }
>>>>
>>>>         } catch (ParserConfigurationException e) {
>>>>             MyLogguer.log("error: " + e.getMessage());
>>>>             e.printStackTrace();
>>>>         } catch (SAXException e) {
>>>>             MyLogguer.log("error: " + e.getMessage());
>>>>             e.printStackTrace();
>>>>         } catch (XPathExpressionException e) {
>>>>             MyLogguer.log("error: " + e.getMessage());
>>>>             e.printStackTrace();
>>>>         }
>>>>
>>>>         IOUtils.copyBytes(resultIS, out, 4096, true);
>>>>         out.close();
>>>>     }
>>>>
>>>>     public static Object getNode(String xpathStr, Node node, QName
>>>> retunType)
>>>>             throws XPathExpressionException {
>>>>         XPath xpath = xpathFactory.newXPath();
>>>>         return xpath.evaluate(xpathStr, node, retunType);
>>>>     }
>>>> }
>>>>
>>>>
>>>>
>>>> --------------------------------------
>>>> Main class:
>>>>
>>>>
>>>> public class Main {
>>>>
>>>>     public static void main(String[] args) throws Exception {
>>>>
>>>>         if (args.length != 2) {
>>>>             System.err
>>>>                     .println("Usage: XMLtoText <input path> <output
>>>> path>");
>>>>             System.exit(-1);
>>>>         }
>>>>
>>>>         Job job = new Job();
>>>>         job.setJarByClass(Main.class);
>>>>         job.setJobName("XML to Text");
>>>>         FileInputFormat.addInputPath(job, new Path(args[0]));
>>>>         FileOutputFormat.setOutputPath(job, new Path(args[1]));
>>>>
>>>>         job.setMapperClass(XmlToTextMapper.class);
>>>>         job.setNumReduceTasks(0);
>>>>         job.setMapOutputKeyClass(Text.class);
>>>>         job.setMapOutputValueClass(Text.class);
>>>>         System.exit(job.waitForCompletion(true) ? 0 : 1);
>>>>
>>>>     }
>>>> }
>>>>
>>>> To execute the job you can use :
>>>>
>>>>          bin/hadoop Main /data.xml /output.
>>>>
>>>>
>>>> Then you can use this to see result.txt file:
>>>>
>>>>           hadoop fs -cat /result.txt
>>>>
>>>>
>>>> I'm using this xml as input:
>>>>
>>>>
>>>> <Comp><Emp><id>1</id><name>NameA</name></data><data><id>2</id><name>NameB</name></Emp></Comp>
>>>>
>>>> and the content in result.txt is like this:
>>>>
>>>> id,name
>>>> 1,NameA
>>>> 2,NameB
>>>>
>>>>
>>>> Hope this helps.
>>>>
>>>>
>>>> 2014/1/3 Ranjini Rathinam <ranjinibecse@gmail.com>
>>>>
>>>>> Hi,
>>>>>
>>>>> Need to convert XML into text using mapreduce.
>>>>>
>>>>> I have used DOM and SAX parser.
>>>>>
>>>>> After using SAX Builder in mapper class. the child node act as root
>>>>> Element.
>>>>>
>>>>> While seeing in Sys out i found thar root element is taking the child
>>>>> element and printing.
>>>>>
>>>>> For Eg,
>>>>>
>>>>> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
>>>>> when this xml is passed in mapper , in sys out printing the root
>>>>> element
>>>>>
>>>>> I am getting the the root element as
>>>>>
>>>>> <id>
>>>>> <name>
>>>>>
>>>>> Please suggest and help to fix this.
>>>>>
>>>>> I need to convert the xml into text using mapreduce code. Please
>>>>> provide with example.
>>>>>
>>>>> Required output is
>>>>>
>>>>> id,name
>>>>> 100,RR
>>>>>
>>>>> Please help.
>>>>>
>>>>> Thanks in advance,
>>>>> Ranjini R
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message