hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Purtell (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-18111) Replication stuck when cluster connection is closed
Date Thu, 01 Jun 2017 01:49:04 GMT

     [ https://issues.apache.org/jira/browse/HBASE-18111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Purtell updated HBASE-18111:
-----------------------------------
    Attachment: HBASE-18111-v2.patch

bq. HBaseInterClusterReplicationEndpoint is a hbase client and write entries to peer cluster.
We should handle the connection close case no matter what reason lead it? And now the replication
will stuck in the while loop.

Sounds reasonable to me. Here's a v2 patch rebased on latest master for another round of HadoopQA

> Replication stuck when cluster connection is closed
> ---------------------------------------------------
>
>                 Key: HBASE-18111
>                 URL: https://issues.apache.org/jira/browse/HBASE-18111
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.0.0, 1.4.0, 1.3.1, 1.2.5, 0.98.24, 1.1.10
>            Reporter: Guanghao Zhang
>            Assignee: Guanghao Zhang
>         Attachments: HBASE-18111.patch, HBASE-18111-v1.patch, HBASE-18111-v2.patch
>
>
> Log:
> {code}
> 2017-05-24,03:01:25,603 ERROR [regionserver13700-SendThread(hostxxx:11000)] org.apache.zookeeper.ClientCnxn:
SASL authentication with Zookeeper Quorum member failed: javax.security.sasl.SaslException:
An error: (java.security.PrivilegedActionException: javax.security.sasl.SaslException: GSS
initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Connection
reset)]) occurred when evaluating Zookeeper Quorum Member's  received SASL token. Zookeeper
Client will go to AUTH_FAILED state.
> 2017-05-24,03:01:25,615 FATAL [regionserver13700-EventThread] org.apache.hadoop.hbase.client.HConnectionImplementation:
hconnection-0x1148dd9b-0x35b6b4d4ca999c6, quorum=10.108.37.30:11000,10.108.38.30:11000,10.108.39.30:11000,10.108.84.25:11000,10.108.84.32:11000,
baseZNode=/hbase/c3prc-xiaomi98 hconnection-0x1148dd9b-0x35b6b4d4ca999c6 received auth failed
from ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = AuthFailed
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:425)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:333)
>         at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
>         at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> 2017-05-24,03:01:25,615 INFO [regionserver13700-EventThread] org.apache.hadoop.hbase.client.HConnectionImplementation:
Closing zookeeper sessionid=0x35b6b4d4ca999c6
> 2017-05-24,03:01:25,623 WARN [regionserver13700.replicationSource,800] org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint:
Replicate edites to peer cluster failed.
> java.io.IOException: Call to hostxxx/10.136.22.6:24600 failed on local exception: java.io.IOException:
Connection closed
> {code}
> jstack
> {code}
>  java.lang.Thread.State: TIMED_WAITING (sleeping)
>         at java.lang.Thread.sleep(Native Method)
>         at org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.sleepForRetries(HBaseInterClusterReplicationEndpoint.java:127)
>         at org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.replicate(HBaseInterClusterReplicationEndpoint.java:199)
>         at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:905)
>         at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:492)
> {code}
> The cluster connection was aborted when the ZookeeperWatcher receive a AuthFailed event.
Then the HBaseInterClusterReplicationEndpoint's replicate() method will stuck in a while loop.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message