hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allan Yang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-7006) [MTTR] Improve Region Server Recovery Time - Distributed Log Replay
Date Wed, 23 Mar 2016 02:05:25 GMT

    [ https://issues.apache.org/jira/browse/HBASE-7006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15207733#comment-15207733
] 

Allan Yang commented on HBASE-7006:
-----------------------------------

I may find a bug in this implementation.
{code:java}
@@ -283,6 +319,9 @@ public class SplitLogManager extends ZooKeeperListener {
       }
     }
     waitForSplittingCompletion(batch, status);
+    // remove recovering regions from ZK
+    this.removeRecoveringRegionsFromZK(serverNames);
+
     if (batch.done != batch.installed) {
       batch.isDead = true;
       SplitLogCounters.tot_mgr_log_split_batch_err.incrementAndGet();
@@ -409,6 +448,171 @@ public class SplitLogManager extends ZooKeeperListener {
     return count;
   }
{code}
In your logic, you wait for the completion of the split batch task. And before you check if
all job is done without error, you removed the recovering regions from ZK. After that, you
check if the batch is done without error and resubmit the task in LogReplayHandler.
That is a big problem, you remove the region's recovering status in ZK before the split&replay
log task is actually done.Though the split task will be resubmit again, but it will skip the
regions that aren't in recovering state. That means some replays haven't done before the region
can be read again, and that means data lose.
Can you look this problem for me? 

> [MTTR] Improve Region Server Recovery Time - Distributed Log Replay
> -------------------------------------------------------------------
>
>                 Key: HBASE-7006
>                 URL: https://issues.apache.org/jira/browse/HBASE-7006
>             Project: HBase
>          Issue Type: New Feature
>          Components: MTTR
>            Reporter: stack
>            Assignee: Jeffrey Zhong
>            Priority: Critical
>             Fix For: 0.98.0, 0.95.1
>
>         Attachments: 7006-addendum-3.txt, LogSplitting Comparison.pdf, ProposaltoimprovelogsplittingprocessregardingtoHBASE-7006-v2.pdf,
hbase-7006-addendum.patch, hbase-7006-combined-v1.patch, hbase-7006-combined-v4.patch, hbase-7006-combined-v5.patch,
hbase-7006-combined-v6.patch, hbase-7006-combined-v7.patch, hbase-7006-combined-v8.patch,
hbase-7006-combined-v9.patch, hbase-7006-combined.patch
>
>
> Just saw interesting issue where a cluster went down  hard and 30 nodes had 1700 WALs
to replay.  Replay took almost an hour.  It looks like it could run faster that much of the
time is spent zk'ing and nn'ing.
> Putting in 0.96 so it gets a look at least.  Can always punt.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message