Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 33C9B17CD0 for ; Sat, 7 Mar 2015 07:34:40 +0000 (UTC) Received: (qmail 98302 invoked by uid 500); 7 Mar 2015 07:34:40 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 98262 invoked by uid 500); 7 Mar 2015 07:34:39 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 98251 invoked by uid 99); 7 Mar 2015 07:34:39 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 07 Mar 2015 07:34:39 +0000 Date: Sat, 7 Mar 2015 07:34:39 +0000 (UTC) From: "zhangduo (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HBASE-13136) TestSplitLogManager.testGetPreviousRecoveryMode is flakey MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-13136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhangduo updated HBASE-13136: ----------------------------- Attachment: HBASE-13136-0.98.patch Patch for 0.98. 1.0 can use the same patch with master. > TestSplitLogManager.testGetPreviousRecoveryMode is flakey > --------------------------------------------------------- > > Key: HBASE-13136 > URL: https://issues.apache.org/jira/browse/HBASE-13136 > Project: HBase > Issue Type: Bug > Affects Versions: 2.0.0, 1.1.0 > Reporter: zhangduo > Assignee: zhangduo > Attachments: HBASE-13136-0.98.patch, HBASE-13136.patch > > > Add test code to run it 100 times then we can make it fail always. > {code:title=TestSplitLogManager.java} > @Test > public void test() throws Exception { > for (int i = 0; i < 100; i++) { > setup(); > testGetPreviousRecoveryMode(); > teardown(); > } > } > {code} > Add then add some ugly debug logs(Yeah I usually debug in this way...) > {code:title=ZKSplitLogManagerCoordination.java} > @Override > public void setRecoveryMode(boolean isForInitialization) throws IOException { > synchronized(this) { > if (this.isDrainingDone) { > // when there is no outstanding splitlogtask after master start up, we already have up to > // date recovery mode > return; > } > } > if (this.watcher == null) { > // when watcher is null(testing code) and recovery mode can only be LOG_SPLITTING > synchronized(this) { > this.isDrainingDone = true; > this.recoveryMode = RecoveryMode.LOG_SPLITTING; > } > return; > } > boolean hasSplitLogTask = false; > boolean hasRecoveringRegions = false; > RecoveryMode previousRecoveryMode = RecoveryMode.UNKNOWN; > RecoveryMode recoveryModeInConfig = > (isDistributedLogReplay(conf)) ? RecoveryMode.LOG_REPLAY : RecoveryMode.LOG_SPLITTING; > // Firstly check if there are outstanding recovering regions > try { > List regions = ZKUtil.listChildrenNoWatch(watcher, watcher.recoveringRegionsZNode); > LOG.debug("=======" + regions); > if (regions != null && !regions.isEmpty()) { > hasRecoveringRegions = true; > previousRecoveryMode = RecoveryMode.LOG_REPLAY; > } > if (previousRecoveryMode == RecoveryMode.UNKNOWN) { > // Secondly check if there are outstanding split log task > List tasks = ZKUtil.listChildrenNoWatch(watcher, watcher.splitLogZNode); > LOG.debug("=======" + tasks); > if (tasks != null && !tasks.isEmpty()) { > hasSplitLogTask = true; > if (isForInitialization) { > // during initialization, try to get recovery mode from splitlogtask > int listSize = tasks.size(); > for (int i = 0; i < listSize; i++) { > String task = tasks.get(i); > try { > byte[] data = > ZKUtil.getData(this.watcher, ZKUtil.joinZNode(watcher.splitLogZNode, task)); > if (data == null) continue; > SplitLogTask slt = SplitLogTask.parseFrom(data); > previousRecoveryMode = slt.getMode(); > if (previousRecoveryMode == RecoveryMode.UNKNOWN) { > // created by old code base where we don't set recovery mode in splitlogtask > // we can safely set to LOG_SPLITTING because we're in master initialization code > // before SSH is enabled & there is no outstanding recovering regions > previousRecoveryMode = RecoveryMode.LOG_SPLITTING; > } > break; > } catch (DeserializationException e) { > LOG.warn("Failed parse data for znode " + task, e); > } catch (InterruptedException e) { > throw new InterruptedIOException(); > } > } > } > } > } > } catch (KeeperException e) { > throw new IOException(e); > } > synchronized (this) { > if (this.isDrainingDone) { > return; > } > if (!hasSplitLogTask && !hasRecoveringRegions) { > this.isDrainingDone = true; > LOG.debug("====set to " + recoveryModeInConfig); > this.recoveryMode = recoveryModeInConfig; > return; > } else if (!isForInitialization) { > // splitlogtask hasn't drained yet, keep existing recovery mode > return; > } > if (previousRecoveryMode != RecoveryMode.UNKNOWN) { > LOG.debug("====set to " + previousRecoveryMode); > this.isDrainingDone = (previousRecoveryMode == recoveryModeInConfig); > this.recoveryMode = previousRecoveryMode; > } else { > LOG.debug("====set to " + recoveryModeInConfig); > this.recoveryMode = recoveryModeInConfig; > } > } > } > {code} > When failing, I got this > {noformat} > 2015-03-02 12:26:12,555 INFO [main] master.TestSplitLogManager(654): testGetPreviousRecoveryMode > 2015-03-02 12:26:12,600 DEBUG [main-EventThread] zookeeper.ZooKeeperWatcher(388): split-log-manager-tests465da653-db4c-475b-bf73-91ac0ca13323-0x14bd8bbf5d90000, quorum=localhost:53589, baseZNode=/hbase Received ZooKeeper Event, type=None, state=Disconnected, path=null > 2015-03-02 12:26:12,600 DEBUG [main-EventThread] zookeeper.ZooKeeperWatcher(477): split-log-manager-tests465da653-db4c-475b-bf73-91ac0ca13323-0x14bd8bbf5d90000, quorum=localhost:53589, baseZNode=/hbase Received Disconnected from ZooKeeper, ignoring > 2015-03-02 12:26:12,621 DEBUG [main] coordination.ZKSplitLogManagerCoordination(875): Distributed log replay=true > 2015-03-02 12:26:12,626 DEBUG [main] coordination.ZKSplitLogManagerCoordination(798): =======null > 2015-03-02 12:26:12,627 DEBUG [main] coordination.ZKSplitLogManagerCoordination(806): =======[testRecovery] > 2015-03-02 12:26:12,628 DEBUG [main] coordination.ZKSplitLogManagerCoordination(855): ====set to LOG_SPLITTING > 2015-03-02 12:26:12,629 INFO [main] coordination.ZKSplitLogManagerCoordination(594): found orphan task testRecovery > 2015-03-02 12:26:12,630 INFO [main] coordination.ZKSplitLogManagerCoordination(598): Found 1 orphan tasks and 0 rescan nodes > 2015-03-02 12:26:12,631 DEBUG [main-EventThread] coordination.ZKSplitLogManagerCoordination(464): task not yet acquired /hbase/splitWAL/testRecovery ver = 0 > 2015-03-02 12:26:12,631 INFO [main-EventThread] coordination.ZKSplitLogManagerCoordination(548): creating orphan task /hbase/splitWAL/testRecovery > 2015-03-02 12:26:12,631 INFO [main-EventThread] coordination.ZKSplitLogManagerCoordination(178): resubmitting unassigned orphan task /hbase/splitWAL/testRecovery > 2015-03-02 12:26:12,631 INFO [main-EventThread] coordination.ZKSplitLogManagerCoordination(229): resubmitting task /hbase/splitWAL/testRecovery > 2015-03-02 12:26:12,632 INFO [main] master.TestSplitLogManager(665): Mode1=LOG_SPLITTING > 2015-03-02 12:26:12,633 DEBUG [main-EventThread] zookeeper.ZooKeeperWatcher(388): split-log-manager-tests9317eabf-735c-4a9d-bc5b-8a9269bb6d62-0x14bd8bbf8be0000, quorum=localhost:55209, baseZNode=/hbase Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/splitWAL/testRecovery > 2015-03-02 12:26:12,639 INFO [main] master.TestSplitLogManager(668): Mode2=LOG_SPLITTING > 2015-03-02 12:26:12,640 DEBUG [main] coordination.ZKSplitLogManagerCoordination(875): Distributed log replay=true > 2015-03-02 12:26:12,641 WARN [main-EventThread] coordination.ZKSplitLogManagerCoordination$GetDataAsyncCallback(1000): task znode /hbase/splitWAL/testRecovery vanished or not created yet. > 2015-03-02 12:26:12,641 DEBUG [main] coordination.ZKSplitLogManagerCoordination(798): =======null > 2015-03-02 12:26:12,642 INFO [main-EventThread] coordination.ZKSplitLogManagerCoordination(472): task /hbase/splitWAL/RESCAN0000000001 entered state: DONE dummy-master,1,1 > 2015-03-02 12:26:12,642 DEBUG [main] coordination.ZKSplitLogManagerCoordination(806): =======[RESCAN0000000001] > 2015-03-02 12:26:12,642 INFO [main] master.TestSplitLogManager(670): Mode3=LOG_SPLITTING > 2015-03-02 12:26:12,642 INFO [main] hbase.ChoreService(303): Chore service for: dummy-master,1,1_splitLogManager_ had [java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@35cd95e8] on shutdown > 2015-03-02 12:26:12,643 INFO [main] zookeeper.MiniZooKeeperCluster(244): Shutdown MiniZK cluster with all ZK servers > {noformat} > We got some tasks here and cause hasSplitLogTask to be true. > {code:title=} > List tasks = ZKUtil.listChildrenNoWatch(watcher, watcher.splitLogZNode); > LOG.debug("=======" + tasks); > if (tasks != null && !tasks.isEmpty()) { > hasSplitLogTask = true; > {code} > So we skipped the mode setting stage and returned. > Not sure if it is a race condition in ZKSplitLogManagerCoordination or just a unit test issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)