hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jim Kellerman (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-57) [hbase] Master should allocate regions to regionservers based upon data locality and rack awareness
Date Fri, 27 Mar 2009 20:39:50 GMT

    [ https://issues.apache.org/jira/browse/HBASE-57?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12690089#action_12690089

Jim Kellerman commented on HBASE-57:

> Samuel Guo added a comment - 26/Mar/09 06:48 AM
> Hi hbasers,
> I'd like to work on this issue as my GSOC project "Exploit locality when assigning regions
in HBase".
> After talking with Stack in emails, I have got some initial thoughts on this issue. I'd
like to share them with you and
> welcome for your comments.
> Before designing a suitable mechanism to using the region's locality, we need to know
how blocks are allocated in
> a hbase cluster and the data-blocks distribution of a specified region over its lifetime
in hbase. so that we can find
> out how the region locality effect the performance. It is difficult to capture all these
information in a real cluster. An
> alternative way to study the locality phenomeon may be simulating the data-block placement
procedure in
> HDFS(local node, local rack, and remote rack) and the regions-allocation mechanism of
a hbase cluster in a single
> machine. And a approximate detail report from simulation can be used for analysis and

Although the JobTracker in Hadoop attempts to assign tasks to machines that are hosting the
data, currently
HDFS clients still must go through the data node to access the blocks. I believe there is
a Jira open for Hadoop
to go directly to local disk to get the blocks (removing communication with the datanode)
if the blocks are on
the local machine.

I think that direct disk access for local blocks would be the biggest payoff.

It is unclear if there is any advantage for locality (other than limiting network access)
if direct disk access is not
available. Benchmarking needs to be done to determine the relative latency of accessing a
block from another
server is much faster than accessing a block (via the datanode) if the block resides on the
local machine. It may
not provide much advantage, unless the block is being served from a server on the other side
of a switch or
router (and of course the speed of the connection). 

Solid performance data evaluating the cost of:
1) network access to a block in a different rack
2) network access to a block in the same rack but on a different server
3) network access to a block on the same server
4) direct disk access to a block on the same server

would be highly useful. If there is little difference between 1, 2, 3 (access to a block through
a datanode) then
locality may not be useful. On the other hand, if there is a significant difference between
1, 2, 3 then we should
try to exploit locality if we can.

I would expect that direct disk access would be much faster than access through a datanode,
but there is no
hard data on that available. If data were available, then that would indicate if direct disk
access (bypassing
the datanode) is useful or not. If so, that argues in favor of implementing direct disk access
as has been
proposed, and if direct disk access were available, that would argue in favor of locality
based region assignment.

As you point out, blocks migrate over time (especially if you are using the HDFS balancer),
and that would 
greatly complicate assignment of regions to region servers. 

Suppose there was one 'hot' datanode that hosted blocks from many regions. Using locality
might end up in
overloading the region server on that node, resulting in poorer performance.

There is a lot of performance evaluation that needs to be done before we actually take the
step of using
locality-based region assignment. If doing that performance evaluation sounds interesting
to you, I think
that would be a great GSOC project.

Before we try locality-based assignment, we need to have this analysis to see if the idea
is worth pursuing.

> [hbase] Master should allocate regions to regionservers based upon data locality and
rack awareness
> ---------------------------------------------------------------------------------------------------
>                 Key: HBASE-57
>                 URL: https://issues.apache.org/jira/browse/HBASE-57
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 0.2.0
>            Reporter: stack
>             Fix For: 0.20.0
> Currently, regions are assigned regionservers based off a basic loading attribute.  A
factor to include in the assignment calcuation is the location of the region in hdfs; i.e.
servers hosting region replicas.  If the cluster is such that regionservers are being run
on the same nodes as those running hdfs, then ideally the regionserver for a particular region
should be running on the same server as hosts a region replica.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message