hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashu Pachauri (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HBASE-18192) Replication drops recovered queues on region server shutdown
Date Thu, 08 Jun 2017 06:17:18 GMT
Ashu Pachauri created HBASE-18192:

             Summary: Replication drops recovered queues on region server shutdown
                 Key: HBASE-18192
                 URL: https://issues.apache.org/jira/browse/HBASE-18192
             Project: HBase
          Issue Type: Bug
          Components: Replication
    Affects Versions: 1.2.6, 1.3.1, 2.0.0, 1.4.0
            Reporter: Ashu Pachauri
            Assignee: Ashu Pachauri
            Priority: Blocker
             Fix For: 2.0.0, 1.4.0, 1.3.2, 1.2.7

When a recovered queue has only one active ReplicationWorkerThread, the recovered queue is
completely dropped on a region server shutdown. This will happen in situation when 
1. DefaultWALProvider is used.
2. RegionGroupingProvider provider is used but replication is stuck on one WAL group for some
reason (for example HBASE-18137)
3. All other replication workers have died due to unhandled exception, and the only one finishes.
This will cause the recovered queue to get deleted without a regionserver shutdown. This can
happen on deployments without fix for HBASE-17381.

The problematic piece of code is:
while (isWorkerActive()){
        // The worker thread run loop...
if (replicationQueueInfo.isQueueRecovered()) {
        // use synchronize to make sure one last thread will clean the queue
        synchronized (workerThreads) {
          Threads.sleep(100);// wait a short while for other worker thread to fully exit
          boolean allOtherTaskDone = true;
          for (ReplicationSourceWorkerThread worker : workerThreads.values()) {
            if (!worker.equals(this) && worker.isAlive()) {
              allOtherTaskDone = false;
          if (allOtherTaskDone) {
            LOG.info("Finished recovering queue " + peerClusterZnode
                + " with the following stats: " + getStats());

The conceptual issue is that isWorkerActive() tells whether a worker is currently running
or not and it's being used as a proxy for whether a worker has finished it's work. But, in
fact, "Should a worker should exit?" and "Has completed it's work?" are two different questions.

This message was sent by Atlassian JIRA

View raw message