hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashu Pachauri (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-18192) Replication drops recovered queues on region server shutdown
Date Wed, 21 Jun 2017 20:51:00 GMT

     [ https://issues.apache.org/jira/browse/HBASE-18192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ashu Pachauri updated HBASE-18192:
----------------------------------
    Release Note: 
If a region server that is processing recovered queue for another previously dead region server
is gracefully shut down, it can drop the recovered queue under certain conditions. Running
without this fix on a 1.2+ release means possibility of continuing data loss in replication,
irrespective of which WALProvider is used.
If a single WAL group (or DefaultWALProvider) is used, this will always cause dataloss in
replication whenever a region server processing recovered queues is gracefully shutdown.

  was:
If region server that is processing recovered queue for another previously dead region server
is gracefully shut down, it can drop the recovered queue under certain conditions. Running
without this fix on a 1.2+ release means possibility of continuing data loss in replication,
irrespective of which WALProvider is used.
If a single WAL group (or DefaultWALProvider) is used, this will always cause dataloss in
replication whenever a region server processing recovered queues is gracefully shutdown.


> Replication drops recovered queues on region server shutdown
> ------------------------------------------------------------
>
>                 Key: HBASE-18192
>                 URL: https://issues.apache.org/jira/browse/HBASE-18192
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 1.3.1, 1.2.6
>            Reporter: Ashu Pachauri
>            Assignee: Ashu Pachauri
>            Priority: Blocker
>             Fix For: 3.0.0, 1.4.0, 1.3.2, 1.2.7, 2.0.0-alpha-2
>
>         Attachments: HBASE-18192.branch-1.001.patch, HBASE-18192.branch-1.3.003.patch,
HBASE-18192.master.001.patch
>
>
> When a recovered queue has only one active ReplicationWorkerThread, the recovered queue
is completely dropped on a region server shutdown. This will happen in situation when 
> 1. DefaultWALProvider is used.
> 2. RegionGroupingProvider provider is used but replication is stuck on one WAL group
for some reason (for example HBASE-18137)
> 3. All other replication workers have died due to unhandled exception, and the only one
finishes. This will cause the recovered queue to get deleted without a regionserver shutdown.
This can happen on deployments without fix for HBASE-17381.
> The problematic piece of code is:
> {Code}
> while (isWorkerActive()){
>         // The worker thread run loop...
> }
> if (replicationQueueInfo.isQueueRecovered()) {
>         // use synchronize to make sure one last thread will clean the queue
>         synchronized (workerThreads) {
>           Threads.sleep(100);// wait a short while for other worker thread to fully exit
>           boolean allOtherTaskDone = true;
>           for (ReplicationSourceWorkerThread worker : workerThreads.values()) {
>             if (!worker.equals(this) && worker.isAlive()) {
>               allOtherTaskDone = false;
>               break;
>             }
>           }
>           if (allOtherTaskDone) {
>             manager.closeRecoveredQueue(this.source);
>             LOG.info("Finished recovering queue " + peerClusterZnode
>                 + " with the following stats: " + getStats());
>           }
>         }
> {Code}
> The conceptual issue is that isWorkerActive() tells whether a worker is currently running
or not and it's being used as a proxy for whether a worker has finished it's work. But, in
fact, "Should a worker should exit?" and "Has completed it's work?" are two different questions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message