hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alan Ho <karlu...@yahoo.ca>
Subject Re: MapReduce Job on XML input
Date Mon, 10 Dec 2007 09:12:28 GMT
I've written a xml input splitter based on a Stax parser. Its much better than StreamXMLRecordReader

----- Original Message ----
From: Peter Thygesen <thygesen@infopaq.dk>
To: hadoop-user@lucene.apache.org
Sent: Monday, November 26, 2007 8:49:52 AM
Subject: MapReduce Job on XML input

I would like to run some mapReduce jobs on some xml files I got (aprox.
100000 compressed files). 
The XML files are not that big about 1 Mb compressed, each containing
about 1000 records. 

Do I have to write my own InputSplitter? Should I use
MultiFileInputFormat or StreamInputFormat? Can I use the
StreamXmlRecordReader, and how? By sub-classing some input class?

The tutorials and examples I've read are all very straight forward
reading simple text files, but I miss a more complex example,
one that reads xml files ;) 


      Looking for the perfect gift? Give the gift of Flickr! 


View raw message