hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Purtell (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-13937) Partially revert HBASE-13172
Date Fri, 19 Jun 2015 23:36:01 GMT

    [ https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14594118#comment-14594118
] 

Andrew Purtell commented on HBASE-13937:
----------------------------------------

bq. Do you mind re-review according to above? 

The check for isDeadServer is moved out of the retry loop:
{code}
@@ -903,13 +901,14 @@ public class ServerManager {
   public boolean isServerReachable(ServerName server) {
     if (server == null) throw new NullPointerException("Passed server is null");
 
+    synchronized (this.onlineServers) {
+      if (this.deadservers.isDeadServer(server)) {
+        return false;
+      }
+    }
+
     RetryCounter retryCounter = pingRetryCounterFactory.create();
     while (retryCounter.shouldRetry()) {
-      synchronized (this.onlineServers) {
-        if (this.deadservers.isDeadServer(server)) {
-          return false;
-        }
-      }
       try {
         AdminService.BlockingInterface admin = getRsAdmin(server);
         if (admin != null) {
{code}

Yes, I was wrong about the second part. The patch under review here does:
{code}
@@ -917,11 +916,6 @@ public class ServerManager {
           return info != null && info.hasServerName()
             && server.getStartcode() == info.getServerName().getStartCode();
         }
-      } catch (RegionServerStoppedException | ServerNotRunningYetException e) {
-        if (LOG.isDebugEnabled()) {
-          LOG.debug("Couldn't reach " + server, e);
-        }
-        break;
{code}
I must have gone back to the wrong tab and looked at the patch on the original issue.

lgtm, FWIW

> Partially revert HBASE-13172 
> -----------------------------
>
>                 Key: HBASE-13937
>                 URL: https://issues.apache.org/jira/browse/HBASE-13937
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Region Assignment
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0
>
>         Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch
>
>
> HBASE-13172 is supposed to fix a UT issue, but causes other problems that parent jira
(HBASE-13605) is attempting to fix. 
> However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, to put
it mildly, major design flaws in AM / RS. 
> Regardless of 13605, the issue with 13172 is that we catch {{ServerNotRunningYetException}}
from {{isServerReachable()}} and return false, which then puts the Server to the {{RegionStates.deadServers}}
list. Once it is in that list, we can still assign and unassign regions to the RS after it
has started (because regular assignment does not check whether the server is in  {{RegionStates.deadServers}}.
However, after the first assign and unassign, we cannot assign the region again since then
the check for the lastServer will think that the server is dead. 
> It turns out that a proper patch for 13605 is very hard without fixing rest of  broken
AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a colorful history). For
1.1.1, I think we should just revert parts of HBASE-13172 for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message