hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@duboce.net>
Subject Re: Processing a large quantity of smaller XML files?
Date Wed, 23 Sep 2009 22:30:14 GMT
On Wed, Sep 23, 2009 at 2:01 PM, Andrzej Jan Taramina

> I asked about the best way to process a large quantity of smaller XML files
> using Hadoop mapred, on the main Hadoop
> mailing list, and was advised that HBase would be a good alternative to
> handle this.
> ...

> What I would like to do is to have Hadoop data nodes also running HBase
> regionservers on the same machine if this is
> feasible, so that when a Map/Reduce job runs, the data it needs to access
> would ideally be local to the machine (eg.
> local HBase region), at least in theory.

> Is this doable/advisable?  Anyone done this before...that is, having a
> Hadoop/HDFS data node running on the same machine
> as a HBase regionserver, where a mapred job running on the Hadoop node will
> access local data on that machine?

This is probably the most common deploy pattern (and yes, MR will assign the
task to the tasktracker that is running beside the regionserver hosting the
region it is supposed to work on) .  The common mistake people make is that
they do this on underpowered machines.  Check our mail archives for multiple
instances of folks overwhelming their hbase/hdfs with mapreduce task
children run amok with i/o and cpu starving hdfs and hbase (See mail with
the subject "HBase, Bigtable, and storage engineering..."; its starts out
with a characterization of this phenomeon).  MapReduce is batch-orientated,
generally idempotent with ten minute timeouts and on failure, tasks are
retried; i.e. its sloppy.  HBase is at the other extreme. So just be
consious that a balance has to be struck between the contending processes.
We can help with this if you have trouble figuring it yourself.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message