hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Jan Taramina <andr...@chaeron.com>
Subject Processing a large quantity of smaller XML files?
Date Mon, 14 Sep 2009 13:40:39 GMT
I'm new to Hadoop, so pardon the potentially dumb question....

I've gathered, from much research, that Hadoop is not always a good choice when you need to
process a whack of smaller
files, which is what we need to do.

More specifically, we need to start by processing about 250K XML files, each of which is in
the 50K - 2M range, with an
average size of 100K bytes.  The processing we need to do on each file is pretty CPU-intensive,
with a lot of pattern
matching. What we need to do would fall nicely into the Map/Reduce paradigm.  Over time, the
volume of files will grow
by an order of magnitude into the range of millions of files, hence the desire to use a mapred
distributed cluster to do
the analysis we need.

Normally, one could just concatenate the XML files into bigger input files.  Unfortunately,
one of our constrains is
that a certain percentage of these XML files will change every night, and so we need to be
able to update the Hadoop
data store (HDFS perhaps) on a regular basis.  This would be difficult if the files are all
concatenated.

The XML data originally comes from a number of XML databases.

Any advice/suggestions on the best way to structure our data storage of all the XML files
so that Hadoop would run
efficiently and we could thus use Map/Reduce on a Hadoop cluster, yet still conveniently update
the changed files on a
nightly basis?

Much appreciated!

-- 
Andrzej Taramina
Chaeron Corporation: Enterprise System Solutions
http://www.chaeron.com

Mime
View raw message