hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From N Keywal <nkey...@gmail.com>
Subject Re: HBase and MapReduce data locality
Date Wed, 29 Aug 2012 05:47:54 GMT

Locations are per block (a file is a set of blocks, a block is replicated
on multiple hdfs datanodes).
We have locality in HBase because hdfs datanodes are deployed on the same
box as the hbase regionserver and hdfs writes one replica of the blocks on
the datanode the same machine as the client (i.e. the regionserver from
hdfs point of view).


On Wed, Aug 29, 2012 at 6:20 AM, Robert Dyer <psybers@gmail.com> wrote:

> I have been reading up on HBase and my understanding is that the
> physical files on the HDFS are split first by region and then by
> column families.
> Thus each column family has its own physical file (on a per-region basis).
> If I run a MapReduce task that uses the HBase as input, wouldn't this
> imply that if the task reads from more than 1 column family the data
> for that row might not be (entirely) local to the task?
> Is there a way to tell the HDFS to keep blocks of each region's column
> families together?

View raw message