hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Busbey (JIRA)" <j...@apache.org>
Subject [jira] [Reopened] (HBASE-20634) Reopen region while server crash can cause the procedure to be stuck
Date Mon, 04 Jun 2018 19:03:00 GMT

     [ https://issues.apache.org/jira/browse/HBASE-20634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Sean Busbey reopened HBASE-20634:

reopening because this has broken compilation on (at least) master:

13:47:02,452 [INFO] -------------------------------------------------------------
13:47:02,453 [INFO] -------------------------------------------------------------
13:47:02,453 [ERROR] /some/path/to/hbase/hbase-server/src/main/java/org/apache/hadoop/hbase/master/replication/RefreshPeerProcedure.java:[169,54]
'void' type not allowed here
13:47:02,453 [INFO] 1 error

Please push an addendum or revert asap

> Reopen region while server crash can cause the procedure to be stuck
> --------------------------------------------------------------------
>                 Key: HBASE-20634
>                 URL: https://issues.apache.org/jira/browse/HBASE-20634
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Duo Zhang
>            Assignee: stack
>            Priority: Critical
>             Fix For: 3.0.0, 2.1.0, 2.0.1
>         Attachments: HBASE-20634-UT.patch, HBASE-20634.branch-2.0.001.patch, HBASE-20634.branch-2.0.002.patch,
HBASE-20634.branch-2.0.003.patch, HBASE-20634.branch-2.0.004.patch, HBASE-20634.branch-2.0.005.patch,
HBASE-20634.branch-2.0.006.patch, HBASE-20634.branch-2.0.006.patch, HBASE-20634.branch-2.0.007.patch,
HBASE-20634.branch-2.0.008.patch, HBASE-20634.branch-2.0.009.patch
> Found this when implementing HBASE-20424, where we will transit the peer sync replication
state while there is server crash.
> The problem is that, in ServerCrashAssign, we do not have the region lock, so it is possible
that after we call handleRIT to clear the existing assign/unassign procedures related to this
rs, and before we schedule the assign procedures, it is possible that that we schedule a unassign
procedure for a region on the crashed rs. This procedure will not receive the ServerCrashException,
instead, in addToRemoteDispatcher, it will find that it can not dispatch the remote call and
then a  FailedRemoteDispatchException will be raised. But we do not treat this exception the
same with ServerCrashException, instead, we will try to expire the rs. Obviously the rs has
already been marked as expired, so this is almost a no-op. Then the procedure will be stuck
there for ever.
> A possible way to fix it is to treat FailedRemoteDispatchException the same with ServerCrashException,
as it will be created in addToRemoteDispatcher only, and the only reason we can not dispatch
a remote call is that the rs has already been dead. The nodeMap is a ConcurrentMap so I think
we could use it as a guard.

This message was sent by Atlassian JIRA

View raw message