Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Date: Mon, 2 Mar 2015 23:33:04 +0000 (UTC)
From: "Jeffrey Zhong (JIRA)" <jira@apache.org>
To: issues@hbase.apache.org
Message-ID: <JIRA.12778670.1425271662000.57386.1425339184944@Atlassian.JIRA>
In-Reply-To: <JIRA.12778670.1425271662000@Atlassian.JIRA>
References: <JIRA.12778670.1425271662000@Atlassian.JIRA>
 <JIRA.12778670.1425271662239@arcas>
Subject: [jira] [Commented] (HBASE-13136)
 TestSplitLogManager.testGetPreviousRecoveryMode is flakey
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HBASE-13136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14344005#comment-14344005 ] 

Jeffrey Zhong commented on HBASE-13136:
---------------------------------------

[~Apache9] There does exist an race condition. Since SplitLogManager has a chore(TimeoutMonitor) which creates rescan znode, the newly created rescan znode causes the flaky ness. 

Below is suggested changes OR we could also just fix test case to make sure there is no znode under splitLogZnode :
{code}
diff --git a/hbase-server/src/main/java/org/apache/hadoop/hbase/coordination/ZKSplitLogManagerCoordination.java b/hbase-server/src/main/java/org/apache/hadoop/hbase/coordination/ZKSplitLogManagerCoordination.java
index 694ccff..8ed4357 100644
--- a/hbase-server/src/main/java/org/apache/hadoop/hbase/coordination/ZKSplitLogManagerCoordination.java
+++ b/hbase-server/src/main/java/org/apache/hadoop/hbase/coordination/ZKSplitLogManagerCoordination.java
@@ -27,6 +27,7 @@ import static org.apache.hadoop.hbase.master.SplitLogManager.TerminationStatus.S

 import java.io.IOException;
 import java.io.InterruptedIOException;
+import java.util.ArrayList;
 import java.util.List;
 import java.util.Set;
 import java.util.concurrent.ConcurrentMap;
@@ -801,7 +802,16 @@ public class ZKSplitLogManagerCoordination extends ZooKeeperListener implements
       }
       if (previousRecoveryMode == RecoveryMode.UNKNOWN) {
         // Secondly check if there are outstanding split log task
-        List<String> tasks = ZKUtil.listChildrenNoWatch(watcher, watcher.splitLogZNode);
+        List<String> tmpTasks = ZKUtil.listChildrenNoWatch(watcher, watcher.splitLogZNode);
+        // Remove rescan nodes
+        List<String> tasks = new ArrayList<String>();
+        for(String tmpTask : tmpTasks) {
+          String znodePath = ZKUtil.joinZNode(watcher.splitLogZNode, tmpTask);
+          if (ZKSplitLog.isRescanNode(watcher, znodePath)) {
+            continue;
+          }
+          tasks.add(tmpTask);
+        }
         if (tasks != null && !tasks.isEmpty()) {
{code}

> TestSplitLogManager.testGetPreviousRecoveryMode is flakey
> ---------------------------------------------------------
>
>                 Key: HBASE-13136
>                 URL: https://issues.apache.org/jira/browse/HBASE-13136
>             Project: HBase
>          Issue Type: Bug
>            Reporter: zhangduo
>
> Add test code to run it 100 times then we can make it fail always.
> {code:title=TestSplitLogManager.java}
>   @Test
>   public void test() throws Exception {
>     for (int i = 0; i < 100; i++) {
>       setup();
>       testGetPreviousRecoveryMode();
>       teardown();
>     }
>   }
> {code}
> Add then add some ugly debug logs(Yeah I usually debug in this way...)
> {code:title=ZKSplitLogManagerCoordination.java}
>   @Override
>   public void setRecoveryMode(boolean isForInitialization) throws IOException {
>     synchronized(this) {
>       if (this.isDrainingDone) {
>         // when there is no outstanding splitlogtask after master start up, we already have up to 
>         // date recovery mode
>         return;
>       }
>     }
>     if (this.watcher == null) {
>       // when watcher is null(testing code) and recovery mode can only be LOG_SPLITTING
>       synchronized(this) {
>         this.isDrainingDone = true;
>         this.recoveryMode = RecoveryMode.LOG_SPLITTING;
>       }
>       return;
>     }
>     boolean hasSplitLogTask = false;
>     boolean hasRecoveringRegions = false;
>     RecoveryMode previousRecoveryMode = RecoveryMode.UNKNOWN;
>     RecoveryMode recoveryModeInConfig =
>         (isDistributedLogReplay(conf)) ? RecoveryMode.LOG_REPLAY : RecoveryMode.LOG_SPLITTING;
>     // Firstly check if there are outstanding recovering regions
>     try {
>       List<String> regions = ZKUtil.listChildrenNoWatch(watcher, watcher.recoveringRegionsZNode);
>       LOG.debug("=======" + regions);
>       if (regions != null && !regions.isEmpty()) {
>         hasRecoveringRegions = true;
>         previousRecoveryMode = RecoveryMode.LOG_REPLAY;
>       }
>       if (previousRecoveryMode == RecoveryMode.UNKNOWN) {
>         // Secondly check if there are outstanding split log task
>         List<String> tasks = ZKUtil.listChildrenNoWatch(watcher, watcher.splitLogZNode);
>         LOG.debug("=======" + tasks);
>         if (tasks != null && !tasks.isEmpty()) {
>           hasSplitLogTask = true;
>           if (isForInitialization) {
>             // during initialization, try to get recovery mode from splitlogtask
>             int listSize = tasks.size();
>             for (int i = 0; i < listSize; i++) {
>               String task = tasks.get(i);
>               try {
>                 byte[] data =
>                     ZKUtil.getData(this.watcher, ZKUtil.joinZNode(watcher.splitLogZNode, task));
>                 if (data == null) continue;
>                 SplitLogTask slt = SplitLogTask.parseFrom(data);
>                 previousRecoveryMode = slt.getMode();
>                 if (previousRecoveryMode == RecoveryMode.UNKNOWN) {
>                   // created by old code base where we don't set recovery mode in splitlogtask
>                   // we can safely set to LOG_SPLITTING because we're in master initialization code
>                   // before SSH is enabled & there is no outstanding recovering regions
>                   previousRecoveryMode = RecoveryMode.LOG_SPLITTING;
>                 }
>                 break;
>               } catch (DeserializationException e) {
>                 LOG.warn("Failed parse data for znode " + task, e);
>               } catch (InterruptedException e) {
>                 throw new InterruptedIOException();
>               }
>             }
>           }
>         }
>       }
>     } catch (KeeperException e) {
>       throw new IOException(e);
>     }
>     synchronized (this) {
>       if (this.isDrainingDone) {
>         return;
>       }
>       if (!hasSplitLogTask && !hasRecoveringRegions) {
>         this.isDrainingDone = true;
>         LOG.debug("====set to " + recoveryModeInConfig);
>         this.recoveryMode = recoveryModeInConfig;
>         return;
>       } else if (!isForInitialization) {
>         // splitlogtask hasn't drained yet, keep existing recovery mode
>         return;
>       }
>       if (previousRecoveryMode != RecoveryMode.UNKNOWN) {
>         LOG.debug("====set to " + previousRecoveryMode);
>         this.isDrainingDone = (previousRecoveryMode == recoveryModeInConfig);
>         this.recoveryMode = previousRecoveryMode;
>       } else {
>         LOG.debug("====set to " + recoveryModeInConfig);
>         this.recoveryMode = recoveryModeInConfig;
>       }
>     }
>   }
> {code}
> When failing, I got this
> {noformat}
> 2015-03-02 12:26:12,555 INFO  [main] master.TestSplitLogManager(654): testGetPreviousRecoveryMode
> 2015-03-02 12:26:12,600 DEBUG [main-EventThread] zookeeper.ZooKeeperWatcher(388): split-log-manager-tests465da653-db4c-475b-bf73-91ac0ca13323-0x14bd8bbf5d90000, quorum=localhost:53589, baseZNode=/hbase Received ZooKeeper Event, type=None, state=Disconnected, path=null
> 2015-03-02 12:26:12,600 DEBUG [main-EventThread] zookeeper.ZooKeeperWatcher(477): split-log-manager-tests465da653-db4c-475b-bf73-91ac0ca13323-0x14bd8bbf5d90000, quorum=localhost:53589, baseZNode=/hbase Received Disconnected from ZooKeeper, ignoring
> 2015-03-02 12:26:12,621 DEBUG [main] coordination.ZKSplitLogManagerCoordination(875): Distributed log replay=true
> 2015-03-02 12:26:12,626 DEBUG [main] coordination.ZKSplitLogManagerCoordination(798): =======null
> 2015-03-02 12:26:12,627 DEBUG [main] coordination.ZKSplitLogManagerCoordination(806): =======[testRecovery]
> 2015-03-02 12:26:12,628 DEBUG [main] coordination.ZKSplitLogManagerCoordination(855): ====set to LOG_SPLITTING
> 2015-03-02 12:26:12,629 INFO  [main] coordination.ZKSplitLogManagerCoordination(594): found orphan task testRecovery
> 2015-03-02 12:26:12,630 INFO  [main] coordination.ZKSplitLogManagerCoordination(598): Found 1 orphan tasks and 0 rescan nodes
> 2015-03-02 12:26:12,631 DEBUG [main-EventThread] coordination.ZKSplitLogManagerCoordination(464): task not yet acquired /hbase/splitWAL/testRecovery ver = 0
> 2015-03-02 12:26:12,631 INFO  [main-EventThread] coordination.ZKSplitLogManagerCoordination(548): creating orphan task /hbase/splitWAL/testRecovery
> 2015-03-02 12:26:12,631 INFO  [main-EventThread] coordination.ZKSplitLogManagerCoordination(178): resubmitting unassigned orphan task /hbase/splitWAL/testRecovery
> 2015-03-02 12:26:12,631 INFO  [main-EventThread] coordination.ZKSplitLogManagerCoordination(229): resubmitting task /hbase/splitWAL/testRecovery
> 2015-03-02 12:26:12,632 INFO  [main] master.TestSplitLogManager(665): Mode1=LOG_SPLITTING
> 2015-03-02 12:26:12,633 DEBUG [main-EventThread] zookeeper.ZooKeeperWatcher(388): split-log-manager-tests9317eabf-735c-4a9d-bc5b-8a9269bb6d62-0x14bd8bbf8be0000, quorum=localhost:55209, baseZNode=/hbase Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/splitWAL/testRecovery
> 2015-03-02 12:26:12,639 INFO  [main] master.TestSplitLogManager(668): Mode2=LOG_SPLITTING
> 2015-03-02 12:26:12,640 DEBUG [main] coordination.ZKSplitLogManagerCoordination(875): Distributed log replay=true
> 2015-03-02 12:26:12,641 WARN  [main-EventThread] coordination.ZKSplitLogManagerCoordination$GetDataAsyncCallback(1000): task znode /hbase/splitWAL/testRecovery vanished or not created yet.
> 2015-03-02 12:26:12,641 DEBUG [main] coordination.ZKSplitLogManagerCoordination(798): =======null
> 2015-03-02 12:26:12,642 INFO  [main-EventThread] coordination.ZKSplitLogManagerCoordination(472): task /hbase/splitWAL/RESCAN0000000001 entered state: DONE dummy-master,1,1
> 2015-03-02 12:26:12,642 DEBUG [main] coordination.ZKSplitLogManagerCoordination(806): =======[RESCAN0000000001]
> 2015-03-02 12:26:12,642 INFO  [main] master.TestSplitLogManager(670): Mode3=LOG_SPLITTING
> 2015-03-02 12:26:12,642 INFO  [main] hbase.ChoreService(303): Chore service for: dummy-master,1,1_splitLogManager_ had [java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@35cd95e8] on shutdown
> 2015-03-02 12:26:12,643 INFO  [main] zookeeper.MiniZooKeeperCluster(244): Shutdown MiniZK cluster with all ZK servers
> {noformat}
> We got some tasks here and cause hasSplitLogTask to be true.
> {code:title=}
>         List<String> tasks = ZKUtil.listChildrenNoWatch(watcher, watcher.splitLogZNode);
>         LOG.debug("=======" + tasks);
>         if (tasks != null && !tasks.isEmpty()) {
>           hasSplitLogTask = true;
> {code}
> So we skipped the mode setting stage and returned.
> Not sure if it is a race condition in ZKSplitLogManagerCoordination or just a unit test issue.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)