hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: MapReduce Job on XML input
Date Mon, 26 Nov 2007 18:02:09 GMT

That isn't all that many files.  At 1MB, you shouldn't be seeing much
performance hit due to reading many files.

You will need a special input format but it can be very simple.  Just extend
something like TextInputFormat and replace the record reader and report the
file as unsplittable.

On 11/26/07 8:49 AM, "Peter Thygesen" <thygesen@infopaq.dk> wrote:

> I would like to run some mapReduce jobs on some xml files I got (aprox.
> 100000 compressed files).
> The XML files are not that big about 1 Mb compressed, each containing
> about 1000 records.
> Do I have to write my own InputSplitter? Should I use
> MultiFileInputFormat or StreamInputFormat? Can I use the
> StreamXmlRecordReader, and how? By sub-classing some input class?
> The tutorials and examples I've read are all very straight forward
> reading simple text files, but I miss a more complex example, especially
> one that reads xml files ;)
> thx. 
> Peter

View raw message