hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohit Anchlia <mohitanch...@gmail.com>
Subject Re: Processing small xml files
Date Sun, 12 Feb 2012 20:30:10 GMT
On Sun, Feb 12, 2012 at 9:24 AM, W.P. McNeill <billmcn@gmail.com> wrote:
> I've used the Mahout XMLInputFormat. It is the right tool if you have an
> XML file with one type of section repeated over and over again and want to
> turn that into Sequence file where each repeated section is a value. I've
> found it helpful as a preprocessing step for converting raw XML input into
> something that can be handled by Hadoop jobs.

Thanks for the input.

Do you first convert it into flat format and then run another hadoop
job or do you just read xml sequence file and then perform reduce on
that. Is there an advantage of first converting it into a flat file
format?
>
> If you're worried about having lots of small files--specifically, about
> overwhelming your namenode because you have too many small
> files--the XMLInputFormat won't help with that. However, it may be possible
> to concatenate the small files into larger files, then have a Hadoop job
> that uses XMLInputFormat transform the large files into sequence files.

How many are too many for namenode? We have around 100M files and 100M
files every year

Mime
View raw message