Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BC1B518767 for ; Thu, 21 May 2015 20:47:17 +0000 (UTC) Received: (qmail 70885 invoked by uid 500); 21 May 2015 20:47:17 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 70832 invoked by uid 500); 21 May 2015 20:47:17 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 70820 invoked by uid 99); 21 May 2015 20:47:17 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 May 2015 20:47:17 +0000 Date: Thu, 21 May 2015 20:47:17 +0000 (UTC) From: "Stephen Yuan Jiang (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-13732) TestHBaseFsck#testParallelWithRetriesHbck fails intermittently MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-13732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555012#comment-14555012 ] Stephen Yuan Jiang commented on HBASE-13732: -------------------------------------------- The failed test TEST-org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat.xml. has nothing to do with the patch. > TestHBaseFsck#testParallelWithRetriesHbck fails intermittently > -------------------------------------------------------------- > > Key: HBASE-13732 > URL: https://issues.apache.org/jira/browse/HBASE-13732 > Project: HBase > Issue Type: Bug > Components: hbck, test > Affects Versions: 2.0.0, 1.1.0, 1.2.0 > Reporter: Stephen Yuan Jiang > Assignee: Stephen Yuan Jiang > Priority: Minor > Fix For: 2.0.0, 1.2.0, 1.1.1 > > Attachments: HBASE-13732.patch > > > TestHBaseFsck#testParallelWithRetriesHbck failed intermittently (especially in Windows environment) with "java.io.IOException: Duplicate hbck - Abort" > {noformat} > java.util.concurrent.ExecutionException: java.io.IOException: Duplicate hbck - Abort > at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252) > at java.util.concurrent.FutureTask.get(FutureTask.java:111) > at org.apache.hadoop.hbase.util.TestHBaseFsck.testParallelWithRetriesHbck(TestHBaseFsck.java:644) > Caused by: java.io.IOException: Duplicate hbck - Abort > at org.apache.hadoop.hbase.util.HBaseFsck.connect(HBaseFsck.java:484) > at org.apache.hadoop.hbase.util.hbck.HbckTestingUtil.doFsck(HbckTestingUtil.java:53) > at org.apache.hadoop.hbase.util.hbck.HbckTestingUtil.doFsck(HbckTestingUtil.java:43) > at org.apache.hadoop.hbase.util.hbck.HbckTestingUtil.doFsck(HbckTestingUtil.java:38) > at org.apache.hadoop.hbase.util.TestHBaseFsck$2RunHbck.call(TestHBaseFsck.java:635) > at org.apache.hadoop.hbase.util.TestHBaseFsck$2RunHbck.call(TestHBaseFsck.java:628) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722) > {noformat} > HBASE-13591 tried to address this issue. It did improve the pass rate in Linux environment (after the fix, I could not repro in my machine); however, the test still failed intermittently in Windows environment during testing of 1.1 release. > Looking at the code, it uses the ExponentialBackoffPolicy (starting with 200ms sleep time after first failed attempt to acquire the lock in ZK, then 400ms, then 800ms, etc.) in between retries. Therefore, even the first hbck run completes, the second hbck run would still fail due to long sleep time. > the proposal to fix the problem is to use ExponentialBackoffPolicyWithLimit and cap the max sleep time to some small number (eg. 5 seconds, it should be configurable). This would make the test more robust. -- This message was sent by Atlassian JIRA (v6.3.4#6332)