Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of mohitanchlia@gmail.com
 designates 209.85.210.48 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAN9z+opLgUg+cL41LhubAiFvdP3Ho9o2=mmUVw+mbU=LnTg6gA@mail.gmail.com>
References: 
 <CAOT3TWoYtaxgrCy-_4M9JT-ugQvy0nffRNrAUi8O5kkzQ5x7cg@mail.gmail.com>
	<CAN9z+opLgUg+cL41LhubAiFvdP3Ho9o2=mmUVw+mbU=LnTg6gA@mail.gmail.com>
Date: Sun, 12 Feb 2012 12:30:10 -0800
Message-ID: 
 <CAOT3TWrKA5Eyv3CkhWEJ9oQTw2ES4KwuPtfiB96iouVinoK8ow@mail.gmail.com>
Subject: Re: Processing small xml files
From: Mohit Anchlia <mohitanchlia@gmail.com>
To: common-user@hadoop.apache.org
Content-Type: text/plain; charset=ISO-8859-1

On Sun, Feb 12, 2012 at 9:24 AM, W.P. McNeill <billmcn@gmail.com> wrote:
> I've used the Mahout XMLInputFormat. It is the right tool if you have an
> XML file with one type of section repeated over and over again and want to
> turn that into Sequence file where each repeated section is a value. I've
> found it helpful as a preprocessing step for converting raw XML input into
> something that can be handled by Hadoop jobs.

Thanks for the input.

Do you first convert it into flat format and then run another hadoop
job or do you just read xml sequence file and then perform reduce on
that. Is there an advantage of first converting it into a flat file
format?
>
> If you're worried about having lots of small files--specifically, about
> overwhelming your namenode because you have too many small
> files--the XMLInputFormat won't help with that. However, it may be possible
> to concatenate the small files into larger files, then have a Hadoop job
> that uses XMLInputFormat transform the large files into sequence files.

How many are too many for namenode? We have around 100M files and 100M
files every year