hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lars George <lars.geo...@gmail.com>
Subject Re: Data locality in HBase
Date Fri, 15 Jun 2012 08:21:46 GMT
Hi Ben,

See inline...

On Jun 15, 2012, at 6:56 AM, Ben Kim wrote:

> Hi,
> I've been posting questions in the mailing-list quiet often lately, and
> here goes another one about data locality
> I read the excellent blog post about data locality that Lars George wrote
> at http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html
> I understand data locality in hbase as locating a region in a region-server
> where most of its data blocks reside.

The opposite is happening, i.e. the region server process triggers for all data it writes
to be located on the same physical machine. 

> So that way fast data access is guranteed when running a MR because each
> map/reduce task is run for each region in the tasktracker where the region
> co-locates.


> But what if the data blocks of the region are evenly spread over multiple
> region-servers?

This will not happen, unless the original server fails. Then the region is moved to another
that now needs to do a lot of remote reads over the network. This is way there is work being
done to allow for custom placement policies in HDFS. That way you can store the entire region
and all copies as complete units on three data nodes. In case of a failure you can then move
the region to one of the two copies. This is not available yet though, but it is being worked
on (so I heard).

> Does a MR task has to remotely access the data blocks from other
> regionservers?

For the above failure case, it would be the region server accessing the remote data, yes.

> How good is hbase locating datablocks where a region resides?

That is again the wrong way around. HBase has no clue as to where blocks reside, nor does
it know that the file system in fact uses separate blocks. HBase stores files, HDFS does the
block magic underneath the hood, and transparent to HBase.

> Also is it correct to say that if i set smaller data block size data
> locality gets worse, and if data block size gets bigger  data locality gets
> better.

This is not applicable here, I am assuming this stems from the above confusion about which
system is handling the blocks, HBase or HDFS. See above.


> Best regards,
> -- 
> *Benjamin Kim*
> *benkimkimben at gmail*

View raw message