hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From brien colwell <xcolw...@gmail.com>
Subject Re: Processing a large quantity of smaller XML files?
Date Mon, 14 Sep 2009 14:13:17 GMT
We have a similar setup of many smaller files. We found that using a 
combined file input (64mb of files are bundled as input to a single map 
job) in conjunction with compressed files on the HDFS much reduced the 
overhead of the map jobs. I haven't compared distributing ZIP/TAR files 
on the HDFS versus using the combined file input ... I assume the 
combined file input is efficient but have not measured versus explicitly 
bundling the input files.




Andrzej Jan Taramina wrote:
> I'm new to Hadoop, so pardon the potentially dumb question....
>
> I've gathered, from much research, that Hadoop is not always a good choice when you need
to process a whack of smaller
> files, which is what we need to do.
>
> More specifically, we need to start by processing about 250K XML files, each of which
is in the 50K - 2M range, with an
> average size of 100K bytes.  The processing we need to do on each file is pretty CPU-intensive,
with a lot of pattern
> matching. What we need to do would fall nicely into the Map/Reduce paradigm.  Over time,
the volume of files will grow
> by an order of magnitude into the range of millions of files, hence the desire to use
a mapred distributed cluster to do
> the analysis we need.
>
> Normally, one could just concatenate the XML files into bigger input files.  Unfortunately,
one of our constrains is
> that a certain percentage of these XML files will change every night, and so we need
to be able to update the Hadoop
> data store (HDFS perhaps) on a regular basis.  This would be difficult if the files are
all concatenated.
>
> The XML data originally comes from a number of XML databases.
>
> Any advice/suggestions on the best way to structure our data storage of all the XML files
so that Hadoop would run
> efficiently and we could thus use Map/Reduce on a Hadoop cluster, yet still conveniently
update the changed files on a
> nightly basis?
>
> Much appreciated!
>
>   


Mime
View raw message