hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <ar...@yahoo-inc.com>
Subject Re: MapReduce Job on XML input
Date Mon, 10 Dec 2007 09:24:50 GMT

On Mon, Dec 10, 2007 at 01:12:28AM -0800, Alan Ho wrote:
>I've written a xml input splitter based on a Stax parser. Its much better than StreamXMLRecordReader

We'd definitely like to see something like this in Hadoop, do you mind contributing it?

Details: http://wiki.apache.org/lucene-hadoop/HowToContribute


>----- Original Message ----
>From: Peter Thygesen <thygesen@infopaq.dk>
>To: hadoop-user@lucene.apache.org
>Sent: Monday, November 26, 2007 8:49:52 AM
>Subject: MapReduce Job on XML input
>I would like to run some mapReduce jobs on some xml files I got (aprox.
>100000 compressed files). 
>The XML files are not that big about 1 Mb compressed, each containing
>about 1000 records. 
>Do I have to write my own InputSplitter? Should I use
>MultiFileInputFormat or StreamInputFormat? Can I use the
>StreamXmlRecordReader, and how? By sub-classing some input class?
>The tutorials and examples I've read are all very straight forward
>reading simple text files, but I miss a more complex example,
> especially
>one that reads xml files ;) 
>      Looking for the perfect gift? Give the gift of Flickr! 

View raw message