hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Piotr Praczyk <piotr.prac...@gmail.com>
Subject Re: Processing a large quantity of smaller XML files?
Date Mon, 14 Sep 2009 13:44:32 GMT
Hi

Maybe you should consider using HBase instead of pure HDFS ?
HDFS tends to have a big block size which would lead to a massive storage
space loss. HBase runs on top of HDFS and would store many files in the same
block yet allowing to modify them selectively.

regards
Piotr

2009/9/14 Andrzej Jan Taramina <andrzej@chaeron.com>

> I'm new to Hadoop, so pardon the potentially dumb question....
>
> I've gathered, from much research, that Hadoop is not always a good choice
> when you need to process a whack of smaller
> files, which is what we need to do.
>
> More specifically, we need to start by processing about 250K XML files,
> each of which is in the 50K - 2M range, with an
> average size of 100K bytes.  The processing we need to do on each file is
> pretty CPU-intensive, with a lot of pattern
> matching. What we need to do would fall nicely into the Map/Reduce
> paradigm.  Over time, the volume of files will grow
> by an order of magnitude into the range of millions of files, hence the
> desire to use a mapred distributed cluster to do
> the analysis we need.
>
> Normally, one could just concatenate the XML files into bigger input files.
>  Unfortunately, one of our constrains is
> that a certain percentage of these XML files will change every night, and
> so we need to be able to update the Hadoop
> data store (HDFS perhaps) on a regular basis.  This would be difficult if
> the files are all concatenated.
>
> The XML data originally comes from a number of XML databases.
>
> Any advice/suggestions on the best way to structure our data storage of all
> the XML files so that Hadoop would run
> efficiently and we could thus use Map/Reduce on a Hadoop cluster, yet still
> conveniently update the changed files on a
> nightly basis?
>
> Much appreciated!
>
> --
> Andrzej Taramina
> Chaeron Corporation: Enterprise System Solutions
> http://www.chaeron.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message