Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
Message-ID: <4AAE47D7.4080905@chaeron.com>
Date: Mon, 14 Sep 2009 09:40:39 -0400
From: Andrzej Jan Taramina <andrzej@chaeron.com>
Organization: Chaeron Corporation
User-Agent: Thunderbird 2.0.0.23 (Windows/20090812)
MIME-Version: 1.0
To: common-user@hadoop.apache.org
Subject: Processing a large quantity of smaller XML files?
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

I'm new to Hadoop, so pardon the potentially dumb question....

I've gathered, from much research, that Hadoop is not always a good choice when you need to process a whack of smaller
files, which is what we need to do.

More specifically, we need to start by processing about 250K XML files, each of which is in the 50K - 2M range, with an
average size of 100K bytes.  The processing we need to do on each file is pretty CPU-intensive, with a lot of pattern
matching. What we need to do would fall nicely into the Map/Reduce paradigm.  Over time, the volume of files will grow
by an order of magnitude into the range of millions of files, hence the desire to use a mapred distributed cluster to do
the analysis we need.

Normally, one could just concatenate the XML files into bigger input files.  Unfortunately, one of our constrains is
that a certain percentage of these XML files will change every night, and so we need to be able to update the Hadoop
data store (HDFS perhaps) on a regular basis.  This would be difficult if the files are all concatenated.

The XML data originally comes from a number of XML databases.

Any advice/suggestions on the best way to structure our data storage of all the XML files so that Hadoop would run
efficiently and we could thus use Map/Reduce on a Hadoop cluster, yet still conveniently update the changed files on a
nightly basis?

Much appreciated!

-- 
Andrzej Taramina
Chaeron Corporation: Enterprise System Solutions
http://www.chaeron.com