hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephen Yuan Jiang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-18036) Data locality is not maintained after cluster restart or SSH
Date Thu, 11 May 2017 23:57:04 GMT

    [ https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16007393#comment-16007393
] 

Stephen Yuan Jiang commented on HBASE-18036:
--------------------------------------------

The V0 patch attached is my first attempt to resolve this issue - The change is in SSH.  By
the time that the SSH is run, if the dead region server has already restarted (we will have
the same hostname and port, but different start code in ServerName), SSH will try to retain
the locality by assigning the region back to the same region server.  I introduce a config
if someone wants to keep the round-robin assignment behavior.  

I forced the existing TestAssignmentManagerOnCluster tests to use the new code path in SSH
and does not see any problem.  The thing missing is that a new UT in TestAssignmentManagerOnCluster
to test the retaining assignment code path in SSH.  

For now, I'd like to post this V0 patch to get some feedback.

> Data locality is not maintained after cluster restart or SSH
> ------------------------------------------------------------
>
>                 Key: HBASE-18036
>                 URL: https://issues.apache.org/jira/browse/HBASE-18036
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>    Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10
>            Reporter: Stephen Yuan Jiang
>            Assignee: Stephen Yuan Jiang
>         Attachments: HBASE-18036.v0-branch-1.1.patch
>
>
> After HBASE-2896 / HBASE-4402, we think data locality is maintained after cluster restart.
 However, we have seem some complains about data locality loss when cluster restart (eg. HBASE-17963).
 
> Examining the AssignmentManager#processDeadServersAndRegionsInTransition() code,  for
cluster start, I expected to hit the following code path:
> {code}
>     if (!failover) {
>       // Fresh cluster startup.
>       LOG.info("Clean cluster startup. Assigning user regions");
>       assignAllUserRegions(allRegions);
>     }
> {code}
> where assignAllUserRegions would use retainAssignment() call in LoadBalancer; however,
from master log,  we usually hit the failover code path:
> {code}
>     // If we found user regions out on cluster, its a failover.
>     if (failover) {
>       LOG.info("Found regions out on cluster or in RIT; presuming failover");
>       // Process list of dead servers and regions in RIT.
>       // See HBASE-4580 for more information.
>       processDeadServersAndRecoverLostRegions(deadServers);
>     }
> {code}
> where processDeadServersAndRecoverLostRegions() would put dead servers in SSH and SSH
uses roundRobinAssignment() in LoadBalancer.  That is why we would see loss locality more
often than retaining locality during cluster restart.
> Note: the code I was looking at is close to branch-1 and branch-1.1.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message