hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ranjini Rathinam <ranjinibe...@gmail.com>
Subject Fwd: XML to TEXT
Date Wed, 12 Feb 2014 08:15:41 GMT
>
> Please help to convert this xml to text.
>>
>>
>>  I have the attached the xml. Please find the attachement.
>>
>> Some student has two address tag and some student has one address tag and
>> some student dont have address tag tag.
>>
>> I need to convert the xml into string.
>>
>> this is my desired output.
>>
>> 100,ranjini,HOME,a street,ad street,ads street,chennai,tn,OFFICE,adsja1
>> street,adsja2 street,adsja3 street,mumbai,Maharastra
>> 101,nivetha,HOME,a street,ad street,ads street,chennai,tn
>> 102,siva
>>
>>
>> In normal java i have written using recursion but how to write in
>> mapreduce.
>>
>> How to write the code in Mapreduce .? Pl help .
>>
>> Thanks in advance.
>>  Regards,
>> Ranjini R
>>
>>
>> On Fri, Jan 10, 2014 at 12:47 PM, Ranjini Rathinam <
>> ranjinibecse@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Its working fine. problem was in xml . THe space i have given.
>>>
>>> Thanks a lot.
>>>
>>> Regards,
>>> Ranjini.R
>>>
>>>  On Thu, Jan 9, 2014 at 10:47 PM, Diego Gutierrez <
>>> diego.gutierrez@ucsp.edu.pe> wrote:
>>>
>>>>  Hi,
>>>>
>>>> I'm sending you the eclipse project with the code. Hope this helps.
>>>>
>>>> Regards
>>>> Diego GutiƩrrez
>>>>
>>>>
>>>>
>>>> 2014/1/9 Ranjini Rathinam <ranjinibecse@gmail.com>
>>>>
>>>>> Hi,
>>>>>
>>>>> I am using here java 1.6 and hadoop 0.20 version ,  ubuntu 12.04.
>>>>>
>>>>> If possible please send the jar and code for review.
>>>>>
>>>>> Thanks for the support,
>>>>>
>>>>> Ranjini
>>>>>
>>>>>  On Wed, Jan 8, 2014 at 11:00 PM, Diego Gutierrez <
>>>>> diego.gutierrez@ucsp.edu.pe> wrote:
>>>>>
>>>>>>   Hi,
>>>>>>
>>>>>> I've notice that your xml file has break lines. Hadoop by default
>>>>>> splits every file into lines and pass them to the map function, in
other
>>>>>> words, each map function process one line of the file. Please remove
the
>>>>>> break lines from your xml and try again. I've tested here with your
xml
>>>>>> file(just changing DTMNodeList list = (DTMNodeList)
>>>>>> getNode("/Company/Employee", doc,
>>>>>>                     XPathConstants.NODESET) ) and this is the output
>>>>>> in result.txt
>>>>>>
>>>>>>
>>>>>> id,name
>>>>>> 100,ranjini,IT1,123456,nextlevel1,Chennai1Navallur1
>>>>>> 1001,ranjinikumar,IT,1234516,nextlevel,ChennaiNavallur
>>>>>>
>>>>>>
>>>>>> Note: I dont know if the java version or hadoop version can be the
>>>>>> problem here. I'm using ubuntu 12.04, java oracle 7 and hadoop 2.2.0.
>>>>>>
>>>>>>
>>>>>> If you want, I can send you the jar file with the code :)
>>>>>>
>>>>>> Regards
>>>>>> Diego GutiƩrrez.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2014/1/7 Ranjini Rathinam <ranjinibecse@gmail.com>
>>>>>>
>>>>>>> Hi Gutierrez ,
>>>>>>>
>>>>>>> As suggest i tried with the code , but in the result.txt i got
>>>>>>> output only header. Nothing else was printing.
>>>>>>>
>>>>>>> After debugging i came to know that while parsing , there is
no
>>>>>>> value.
>>>>>>>
>>>>>>> The problem is in line given below which is bold. While putting
>>>>>>> SysOut i found no value printing in this line.
>>>>>>>
>>>>>>>  String xmlContent = value.toString();
>>>>>>>
>>>>>>>         InputStream is = new
>>>>>>> ByteArrayInputStream(xmlContent.getBytes());
>>>>>>>         DocumentBuilderFactory factory =
>>>>>>> DocumentBuilderFactory.newInstance();
>>>>>>>         DocumentBuilder builder;
>>>>>>>         try {
>>>>>>>             builder = factory.newDocumentBuilder();
>>>>>>>
>>>>>>> * Document doc = builder.parse(is);*
>>>>>>>    String ed=doc.getDocumentElement().getNodeName();
>>>>>>>    out.write(ed.getBytes());
>>>>>>>             DTMNodeList list = (DTMNodeList)
>>>>>>> getNode("/Company/Employee", doc,XPathConstants.NODESET);
>>>>>>>
>>>>>>> When iam printing
>>>>>>>
>>>>>>> out.write(xmlContent.getBytes):- the whole xml is being printed.
>>>>>>>
>>>>>>> then i wrote for Sysout for list ,nothing printed.
>>>>>>>  out.write(ed.getBytes):- nothing is being printed.
>>>>>>>
>>>>>>> Please suggest where i am going wrong. Please help to fix this.
>>>>>>>
>>>>>>> Thanks in advance.
>>>>>>>
>>>>>>> I have attached my code.Please review.
>>>>>>>
>>>>>>>
>>>>>>> Mapper class:-
>>>>>>>
>>>>>>> public class XmlTextMapper extends Mapper<LongWritable, Text,
Text,
>>>>>>> Text> {
>>>>>>>      private static final XPathFactory xpathFactory =
>>>>>>> XPathFactory.newInstance();
>>>>>>>     @Override
>>>>>>>     public void map(LongWritable key, Text value, Context context)
>>>>>>>             throws IOException, InterruptedException {
>>>>>>>         String resultFileName = "/user/task/Sales/result.txt";
>>>>>>>
>>>>>>>         Configuration conf = new Configuration();
>>>>>>>         FileSystem fs = FileSystem.get(URI.create(resultFileName),
>>>>>>> conf);
>>>>>>>         FSDataOutputStream out = fs.create(new Path(resultFileName));
>>>>>>>         InputStream resultIS = new ByteArrayInputStream(new byte[0]);
>>>>>>>         String header = "id,name\n";
>>>>>>>         out.write(header.getBytes());
>>>>>>>          String xmlContent = value.toString();
>>>>>>>
>>>>>>>         InputStream is = new
>>>>>>> ByteArrayInputStream(xmlContent.getBytes());
>>>>>>>         DocumentBuilderFactory factory =
>>>>>>> DocumentBuilderFactory.newInstance();
>>>>>>>         DocumentBuilder builder;
>>>>>>>         try {
>>>>>>>             builder = factory.newDocumentBuilder();
>>>>>>>             Document doc = builder.parse(is);
>>>>>>>    String ed=doc.getDocumentElement().getNodeName();
>>>>>>>    out.write(ed.getBytes());
>>>>>>>             DTMNodeList list = (DTMNodeList)
>>>>>>> getNode("/Company/Employee", doc,XPathConstants.NODESET);
>>>>>>>              int size = list.getLength();
>>>>>>>             for (int i = 0; i < size; i++) {
>>>>>>>                 Node node = list.item(i);
>>>>>>>                 String line = "";
>>>>>>>                 NodeList nodeList = node.getChildNodes();
>>>>>>>                 int childNumber = nodeList.getLength();
>>>>>>>                 for (int j = 0; j < childNumber; j++)
>>>>>>>     {
>>>>>>>                     line += nodeList.item(j).getTextContent()
+ ",";
>>>>>>>                 }
>>>>>>>                 if (line.endsWith(","))
>>>>>>>                     line = line.substring(0, line.length() -
1);
>>>>>>>                 line += "\n";
>>>>>>>                 out.write(line.getBytes());
>>>>>>>             }
>>>>>>>         } catch (ParserConfigurationException e) {
>>>>>>>              e.printStackTrace();
>>>>>>>         } catch (SAXException e) {
>>>>>>>              e.printStackTrace();
>>>>>>>         } catch (XPathExpressionException e) {
>>>>>>>              e.printStackTrace();
>>>>>>>         }
>>>>>>>          IOUtils.copyBytes(resultIS, out, 4096, true);
>>>>>>>         out.close();
>>>>>>>     }
>>>>>>>     public static Object getNode(String xpathStr, Node node,
QName
>>>>>>> retunType)
>>>>>>>             throws XPathExpressionException {
>>>>>>>         XPath xpath = xpathFactory.newXPath();
>>>>>>>         return xpath.evaluate(xpathStr, node, retunType);
>>>>>>>     }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Main class
>>>>>>> public class MainXml {
>>>>>>>      public static void main(String[] args) throws Exception
{
>>>>>>> Configuration conf = new Configuration();
>>>>>>>         if (args.length != 2) {
>>>>>>>             System.err
>>>>>>>                     .println("Usage: XMLtoText <input path>
<output
>>>>>>> path>");
>>>>>>>             System.exit(-1);
>>>>>>>         }
>>>>>>>   String output="/user/task/Sales/";
>>>>>>>        Job job = new Job(conf, "XML to Text");
>>>>>>>         job.setJarByClass(MainXml.class);
>>>>>>>        // job.setJobName("XML to Text");
>>>>>>>
>>>>>>>         FileInputFormat.addInputPath(job, new Path(args[0]));
>>>>>>>        // FileOutputFormat.setOutputPath(job, new Path(args[1]));
>>>>>>>   Path outPath = new Path(output);
>>>>>>>   FileOutputFormat.setOutputPath(job, outPath);
>>>>>>>   FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
>>>>>>>   if (dfs.exists(outPath)) {
>>>>>>>   dfs.delete(outPath, true);
>>>>>>>   }
>>>>>>>         job.setMapperClass(XmlTextMapper.class);
>>>>>>>
>>>>>>>         job.setNumReduceTasks(0);
>>>>>>>         job.setMapOutputKeyClass(Text.class);
>>>>>>>         job.setMapOutputValueClass(Text.class);
>>>>>>>         System.exit(job.waitForCompletion(true) ? 0 : 1);
>>>>>>>     }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> My xml file
>>>>>>>
>>>>>>> <Company>
>>>>>>> <Employee>
>>>>>>> <id>100</id>
>>>>>>> <ename>ranjini</ename>
>>>>>>> <dept>IT1</dept>
>>>>>>> <sal>123456</sal>
>>>>>>> <location>nextlevel1</location>
>>>>>>> <Address>
>>>>>>> <Home>Chennai1</Home>
>>>>>>> <Office>Navallur1</Office>
>>>>>>> </Address>
>>>>>>> </Employee>
>>>>>>> <Employee>
>>>>>>> <id>1001</id>
>>>>>>> <ename>ranjinikumar</ename>
>>>>>>> <dept>IT</dept>
>>>>>>> <sal>1234516</sal>
>>>>>>> <location>nextlevel</location>
>>>>>>> <Address>
>>>>>>> <Home>Chennai</Home>
>>>>>>> <Office>Navallur</Office>
>>>>>>> </Address>
>>>>>>> </Employee>
>>>>>>> </Company>
>>>>>>>
>>>>>>>
>>>>>>> Thanks in advance.
>>>>>>>
>>>>>>> Ranjini
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>  On Mon, Jan 6, 2014 at 2:44 PM, Ranjini Rathinam <
>>>>>>>> ranjinibecse@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Thanks a lot .
>>>>>>>>>
>>>>>>>>> Ranjini
>>>>>>>>>
>>>>>>>>> On Fri, Jan 3, 2014 at 10:40 PM, Diego Gutierrez <
>>>>>>>>> diego.gutierrez@ucsp.edu.pe> wrote:
>>>>>>>>>
>>>>>>>>>>  Hi,
>>>>>>>>>>
>>>>>>>>>> I suggest to use the XPath, this is a native java
support for
>>>>>>>>>> parse xml and json formats.
>>>>>>>>>>
>>>>>>>>>> For the main problem, like distcp command(
>>>>>>>>>> http://hadoop.apache.org/docs/r0.19.0/distcp.pdf
) there is no
>>>>>>>>>> need of a reduce function, because you can parse
the xml input file and
>>>>>>>>>> create the file you need in the map function.For
example the following code
>>>>>>>>>> reads an xml file in HDFS, parse it and create a
new file ( "/result.txt" )
>>>>>>>>>> with the expected format:
>>>>>>>>>> id,name
>>>>>>>>>> 100,RR
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Mapper function:
>>>>>>>>>>
>>>>>>>>>> import java.io.ByteArrayInputStream;
>>>>>>>>>> import java.io.IOException;
>>>>>>>>>> import java.io.InputStream;
>>>>>>>>>> import java.net.URI;
>>>>>>>>>>
>>>>>>>>>> import javax.xml.namespace.QName;
>>>>>>>>>> import javax.xml.parsers.DocumentBuilder;
>>>>>>>>>> import javax.xml.parsers.DocumentBuilderFactory;
>>>>>>>>>> import javax.xml.parsers.ParserConfigurationException;
>>>>>>>>>> import javax.xml.xpath.XPath;
>>>>>>>>>> import javax.xml.xpath.XPathConstants;
>>>>>>>>>> import javax.xml.xpath.XPathExpressionException;
>>>>>>>>>> import javax.xml.xpath.XPathFactory;
>>>>>>>>>>
>>>>>>>>>> import org.apache.hadoop.conf.Configuration;
>>>>>>>>>> import org.apache.hadoop.fs.FSDataOutputStream;
>>>>>>>>>> import org.apache.hadoop.fs.FileSystem;
>>>>>>>>>> import org.apache.hadoop.fs.Path;
>>>>>>>>>> import org.apache.hadoop.io.IOUtils;
>>>>>>>>>> import org.apache.hadoop.io.LongWritable;
>>>>>>>>>> import org.apache.hadoop.io.Text;
>>>>>>>>>> import org.apache.hadoop.mapreduce.Mapper;
>>>>>>>>>> import org.w3c.dom.Document;
>>>>>>>>>> import org.w3c.dom.Node;
>>>>>>>>>> import org.w3c.dom.NodeList;
>>>>>>>>>> import org.xml.sax.SAXException;
>>>>>>>>>>
>>>>>>>>>> import com.sun.org.apache.xml.internal.dtm.ref.DTMNodeList;
>>>>>>>>>>
>>>>>>>>>> public class XmlToTextMapper extends Mapper<LongWritable,
Text,
>>>>>>>>>> Text, Text> {
>>>>>>>>>>
>>>>>>>>>>     private static final XPathFactory xpathFactory
=
>>>>>>>>>> XPathFactory.newInstance();
>>>>>>>>>>
>>>>>>>>>>     @Override
>>>>>>>>>>     public void map(LongWritable key, Text value,
Context context)
>>>>>>>>>>             throws IOException, InterruptedException
{
>>>>>>>>>>
>>>>>>>>>>         String resultFileName = "/result.txt";
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>         Configuration conf = new Configuration();
>>>>>>>>>>         FileSystem fs =
>>>>>>>>>> FileSystem.get(URI.create(resultFileName), conf);
>>>>>>>>>>         FSDataOutputStream out = fs.create(new
>>>>>>>>>> Path(resultFileName));
>>>>>>>>>>
>>>>>>>>>>         InputStream resultIS = new ByteArrayInputStream(new
>>>>>>>>>> byte[0]);
>>>>>>>>>>
>>>>>>>>>>         String header = "id,name\n";
>>>>>>>>>>         out.write(header.getBytes());
>>>>>>>>>>
>>>>>>>>>>         String xmlContent = value.toString();
>>>>>>>>>>         InputStream is = new
>>>>>>>>>> ByteArrayInputStream(xmlContent.getBytes());
>>>>>>>>>>         DocumentBuilderFactory factory =
>>>>>>>>>> DocumentBuilderFactory.newInstance();
>>>>>>>>>>         DocumentBuilder builder;
>>>>>>>>>>         try {
>>>>>>>>>>             builder = factory.newDocumentBuilder();
>>>>>>>>>>             Document doc = builder.parse(is);
>>>>>>>>>>             DTMNodeList list = (DTMNodeList)
>>>>>>>>>> getNode("/main/data", doc,
>>>>>>>>>>                     XPathConstants.NODESET);
>>>>>>>>>>
>>>>>>>>>>             int size = list.getLength();
>>>>>>>>>>             for (int i = 0; i < size; i++) {
>>>>>>>>>>                 Node node = list.item(i);
>>>>>>>>>>                 String line = "";
>>>>>>>>>>                 NodeList nodeList = node.getChildNodes();
>>>>>>>>>>                 int childNumber = nodeList.getLength();
>>>>>>>>>>                 for (int j = 0; j < childNumber;
j++) {
>>>>>>>>>>                     line += nodeList.item(j).getTextContent()
+
>>>>>>>>>> ",";
>>>>>>>>>>                 }
>>>>>>>>>>                 if (line.endsWith(","))
>>>>>>>>>>                     line = line.substring(0, line.length()
- 1);
>>>>>>>>>>                 line += "\n";
>>>>>>>>>>                 out.write(line.getBytes());
>>>>>>>>>>
>>>>>>>>>>             }
>>>>>>>>>>
>>>>>>>>>>         } catch (ParserConfigurationException e)
{
>>>>>>>>>>             MyLogguer.log("error: " + e.getMessage());
>>>>>>>>>>             e.printStackTrace();
>>>>>>>>>>         } catch (SAXException e) {
>>>>>>>>>>             MyLogguer.log("error: " + e.getMessage());
>>>>>>>>>>             e.printStackTrace();
>>>>>>>>>>         } catch (XPathExpressionException e) {
>>>>>>>>>>             MyLogguer.log("error: " + e.getMessage());
>>>>>>>>>>             e.printStackTrace();
>>>>>>>>>>         }
>>>>>>>>>>
>>>>>>>>>>         IOUtils.copyBytes(resultIS, out, 4096, true);
>>>>>>>>>>         out.close();
>>>>>>>>>>     }
>>>>>>>>>>
>>>>>>>>>>     public static Object getNode(String xpathStr,
Node node,
>>>>>>>>>> QName retunType)
>>>>>>>>>>             throws XPathExpressionException {
>>>>>>>>>>         XPath xpath = xpathFactory.newXPath();
>>>>>>>>>>         return xpath.evaluate(xpathStr, node, retunType);
>>>>>>>>>>     }
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --------------------------------------
>>>>>>>>>> Main class:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> public class Main {
>>>>>>>>>>
>>>>>>>>>>     public static void main(String[] args) throws
Exception {
>>>>>>>>>>
>>>>>>>>>>         if (args.length != 2) {
>>>>>>>>>>             System.err
>>>>>>>>>>                     .println("Usage: XMLtoText <input
path>
>>>>>>>>>> <output path>");
>>>>>>>>>>             System.exit(-1);
>>>>>>>>>>         }
>>>>>>>>>>
>>>>>>>>>>         Job job = new Job();
>>>>>>>>>>         job.setJarByClass(Main.class);
>>>>>>>>>>         job.setJobName("XML to Text");
>>>>>>>>>>         FileInputFormat.addInputPath(job, new Path(args[0]));
>>>>>>>>>>         FileOutputFormat.setOutputPath(job, new Path(args[1]));
>>>>>>>>>>
>>>>>>>>>>         job.setMapperClass(XmlToTextMapper.class);
>>>>>>>>>>         job.setNumReduceTasks(0);
>>>>>>>>>>         job.setMapOutputKeyClass(Text.class);
>>>>>>>>>>         job.setMapOutputValueClass(Text.class);
>>>>>>>>>>         System.exit(job.waitForCompletion(true) ?
0 : 1);
>>>>>>>>>>
>>>>>>>>>>     }
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> To execute the job you can use :
>>>>>>>>>>
>>>>>>>>>>          bin/hadoop Main /data.xml /output.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Then you can use this to see result.txt file:
>>>>>>>>>>
>>>>>>>>>>           hadoop fs -cat /result.txt
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I'm using this xml as input:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> <Comp><Emp><id>1</id><name>NameA</name></data><data><id>2</id><name>NameB</name></Emp></Comp>
>>>>>>>>>>
>>>>>>>>>> and the content in result.txt is like this:
>>>>>>>>>>
>>>>>>>>>> id,name
>>>>>>>>>> 1,NameA
>>>>>>>>>> 2,NameB
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hope this helps.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2014/1/3 Ranjini Rathinam <ranjinibecse@gmail.com>
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Need to convert XML into text using mapreduce.
>>>>>>>>>>>
>>>>>>>>>>> I have used DOM and SAX parser.
>>>>>>>>>>>
>>>>>>>>>>> After using SAX Builder in mapper class. the
child node act as
>>>>>>>>>>> root Element.
>>>>>>>>>>>
>>>>>>>>>>> While seeing in Sys out i found thar root element
is taking the
>>>>>>>>>>> child element and printing.
>>>>>>>>>>>
>>>>>>>>>>> For Eg,
>>>>>>>>>>>
>>>>>>>>>>> <Comp><Emp><id>100</id><name>RR</name></Emp></Comp>
>>>>>>>>>>> when this xml is passed in mapper , in sys out
printing the root
>>>>>>>>>>> element
>>>>>>>>>>>
>>>>>>>>>>> I am getting the the root element as
>>>>>>>>>>>
>>>>>>>>>>> <id>
>>>>>>>>>>> <name>
>>>>>>>>>>>
>>>>>>>>>>> Please suggest and help to fix this.
>>>>>>>>>>>
>>>>>>>>>>> I need to convert the xml into text using mapreduce
code. Please
>>>>>>>>>>> provide with example.
>>>>>>>>>>>
>>>>>>>>>>> Required output is
>>>>>>>>>>>
>>>>>>>>>>> id,name
>>>>>>>>>>> 100,RR
>>>>>>>>>>>
>>>>>>>>>>> Please help.
>>>>>>>>>>>
>>>>>>>>>>> Thanks in advance,
>>>>>>>>>>> Ranjini R
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>

Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message