hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Data locality in HBase
Date Thu, 21 Jun 2012 05:19:24 GMT
Minor addition to what Lars G said.
In trunk, load balancer is able to utilize block location information when
it chooses the region server receiving a region.
See the following in RegionLocationFinder:

   * Returns an ordered list of hosts that are hosting the blocks for this
region. The weight of
...
  protected List<ServerName> internalGetTopBlockLocations(HRegionInfo
region) {


On Fri, Jun 15, 2012 at 1:21 AM, Lars George <lars.george@gmail.com> wrote:

> Hi Ben,
>
> See inline...
>
> On Jun 15, 2012, at 6:56 AM, Ben Kim wrote:
>
> > Hi,
> >
> > I've been posting questions in the mailing-list quiet often lately, and
> > here goes another one about data locality
> > I read the excellent blog post about data locality that Lars George wrote
> > at http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html
> >
> > I understand data locality in hbase as locating a region in a
> region-server
> > where most of its data blocks reside.
>
> The opposite is happening, i.e. the region server process triggers for all
> data it writes to be located on the same physical machine.
>
> > So that way fast data access is guranteed when running a MR because each
> > map/reduce task is run for each region in the tasktracker where the
> region
> > co-locates.
>
> Correct.
>
> > But what if the data blocks of the region are evenly spread over multiple
> > region-servers?
>
> This will not happen, unless the original server fails. Then the region is
> moved to another that now needs to do a lot of remote reads over the
> network. This is way there is work being done to allow for custom placement
> policies in HDFS. That way you can store the entire region and all copies
> as complete units on three data nodes. In case of a failure you can then
> move the region to one of the two copies. This is not available yet though,
> but it is being worked on (so I heard).
>
> > Does a MR task has to remotely access the data blocks from other
> > regionservers?
>
> For the above failure case, it would be the region server accessing the
> remote data, yes.
>
> > How good is hbase locating datablocks where a region resides?
>
> That is again the wrong way around. HBase has no clue as to where blocks
> reside, nor does it know that the file system in fact uses separate blocks.
> HBase stores files, HDFS does the block magic underneath the hood, and
> transparent to HBase.
>
> > Also is it correct to say that if i set smaller data block size data
> > locality gets worse, and if data block size gets bigger  data locality
> gets
> > better.
>
> This is not applicable here, I am assuming this stems from the above
> confusion about which system is handling the blocks, HBase or HDFS. See
> above.
>
> HTH,
> Lars
>
> >
> > Best regards,
> > --
> >
> > *Benjamin Kim*
> > *benkimkimben at gmail*
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message