hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-18192) Replication drops recovered queues on region server shutdown
Date Sat, 10 Jun 2017 08:19:21 GMT

    [ https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16045453#comment-16045453
] 

Hudson commented on HBASE-18192:
--------------------------------

FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #3168 (See [https://builds.apache.org/job/HBase-Trunk_matrix/3168/])
HBASE-18192: Replication drops recovered queues on region server (tedyu: rev eb2dc5d2a524f816fc5cf707b853117bc6ada01a)
* (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/RecoveredReplicationSourceShipperThread.java
* (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceShipperThread.java
* (edit) hbase-server/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationSource.java
* (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java
* (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/RecoveredReplicationSource.java


> Replication drops recovered queues on region server shutdown
> ------------------------------------------------------------
>
>                 Key: HBASE-18192
>                 URL: https://issues.apache.org/jira/browse/HBASE-18192
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 1.3.1, 1.2.6
>            Reporter: Ashu Pachauri
>            Assignee: Ashu Pachauri
>            Priority: Blocker
>             Fix For: 3.0.0, 1.4.0, 1.3.2, 1.2.7, 2.0.0-alpha-2
>
>         Attachments: HBASE-18192.branch-1.001.patch, HBASE-18192.branch-1.3.003.patch,
HBASE-18192.master.001.patch
>
>
> When a recovered queue has only one active ReplicationWorkerThread, the recovered queue
is completely dropped on a region server shutdown. This will happen in situation when 
> 1. DefaultWALProvider is used.
> 2. RegionGroupingProvider provider is used but replication is stuck on one WAL group
for some reason (for example HBASE-18137)
> 3. All other replication workers have died due to unhandled exception, and the only one
finishes. This will cause the recovered queue to get deleted without a regionserver shutdown.
This can happen on deployments without fix for HBASE-17381.
> The problematic piece of code is:
> {Code}
> while (isWorkerActive()){
>         // The worker thread run loop...
> }
> if (replicationQueueInfo.isQueueRecovered()) {
>         // use synchronize to make sure one last thread will clean the queue
>         synchronized (workerThreads) {
>           Threads.sleep(100);// wait a short while for other worker thread to fully exit
>           boolean allOtherTaskDone = true;
>           for (ReplicationSourceWorkerThread worker : workerThreads.values()) {
>             if (!worker.equals(this) && worker.isAlive()) {
>               allOtherTaskDone = false;
>               break;
>             }
>           }
>           if (allOtherTaskDone) {
>             manager.closeRecoveredQueue(this.source);
>             LOG.info("Finished recovering queue " + peerClusterZnode
>                 + " with the following stats: " + getStats());
>           }
>         }
> {Code}
> The conceptual issue is that isWorkerActive() tells whether a worker is currently running
or not and it's being used as a proxy for whether a worker has finished it's work. But, in
fact, "Should a worker should exit?" and "Has completed it's work?" are two different questions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message