hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Re: Processing a large quantity of smaller XML files?
Date Fri, 18 Sep 2009 03:37:24 GMT
If you have a unique id for each file (as you must if you are updating
them), then I think you would find it pretty surprising how fast you can
merge your old archive file and the updated/added versions.

This is a special case of the TB update puzzle.  If you take as given that
you have a conventional disk with 100MB/s transfer and 10ms seek and
rotation times, and supposing that you want to update 1% of the 100B records
in a TB database, you will find that simply copying the entire database,
inserting the updated records into the copy at the right places is about
100x faster than naive random accesses.

Moreover, in your case, you will be able to parallelize the update step
which should make it even faster.  If your updates are available in sorted
order, then you can do this with a map-side merge and no reduce.

You should be able to sustain about 200-500MB/s read speed for every 10
spindles that your Hadoop lives on.  Assuming that your cluster is 20-50
nodes, you should be able to update your database at about 20 minutes or
less per TB of data.  Is that really too slow?

The cheap alternative is to just keep each nights updates in a separate
archive and keep a separate file with pointers to which archive the latest
version is in.  Your map program would consult the separate file as each
file is read to determine if it is looking at the most recent and only use
it if it is recent.  Occasional merges would be in order to make sure most
of the files you scan are relatively large.  This is essentially what hbase
would be doing, but is stripped to the essentials.

On Thu, Sep 17, 2009 at 5:59 PM, Andrzej Jan Taramina

> > The simplest thing you could do is to use the Hadoop ARchive format
> > (HAR) in a pre-processing step.  The best thing you could do is to have
> > a pre-processing step based on sequence file (note: either Oozie or
> > Cascading are great workflow systems to help you out).
> That doesn't work, since some of our files are updated every night.
> > When you say "update" nightly, do you mean "add new files" or "update
> > existing files"?  If you really mean changing existing files, HBase
> > might be good for you -
> We have to change existing files...and add some new ones as well.  So HAR
> won't really cut it for us.

Ted Dunning, CTO

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message