Return-Path: X-Original-To: apmail-hbase-issues-archive@www.apache.org Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2F98A1778C for ; Thu, 23 Apr 2015 15:05:39 +0000 (UTC) Received: (qmail 96993 invoked by uid 500); 23 Apr 2015 15:05:39 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 96946 invoked by uid 500); 23 Apr 2015 15:05:39 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 96931 invoked by uid 99); 23 Apr 2015 15:05:38 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Apr 2015 15:05:38 +0000 Date: Thu, 23 Apr 2015 15:05:38 +0000 (UTC) From: "Hudson (JIRA)" To: issues@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HBASE-13526) TestRegionServerReportForDuty can be flaky: hang or timeout MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-13526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14509184#comment-14509184 ] Hudson commented on HBASE-13526: -------------------------------- FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #914 (See [https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/914/]) HBASE-13526 TestRegionServerReportForDuty can be flaky: hang or timeout (jerryjch: rev 868027a50051db2e20e5fb9f2babd782793a646b) * hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestRegionServerReportForDuty.java > TestRegionServerReportForDuty can be flaky: hang or timeout > ----------------------------------------------------------- > > Key: HBASE-13526 > URL: https://issues.apache.org/jira/browse/HBASE-13526 > Project: HBase > Issue Type: Bug > Components: test > Affects Versions: 2.0.0, 1.1.0, 0.98.12 > Reporter: Jerry He > Assignee: Jerry He > Priority: Minor > Fix For: 2.0.0, 1.1.0, 0.98.13, 1.0.2, 1.2.0 > > Attachments: HBASE-13526.patch > > > This test case is from HBASE-13317. > The test uses a custom region server to simulate reportForDuty in a master failover case. This custom RS would start, then the primary master would fail, then the custom RS would reportForDuty to the second master after master failover. > The test occasionally will hang or timeout. > The root cause is that during first master initialization, the master would assign meta (and create and assign namespace table). It is possible that the meta is assigned to the custom RS, which has started (place a rs node on the ZK), but will not really check-in and be online. Then the master will go thru multiple re-assignment, which can be lengthy and cause trouble. > There are a couple of issues I see in the master assignment code: > 1. Master puts all the region servers obtained from ZK rs node into the online server list, including those that have not checked-in via RPC. And we will assign meta or other regions based on whole list. > 2. When one assign plan fails, we don't exclude the failed server when picking the next destination, which may prolong the assignment process. > I will provide a patch to fix the test case. The other issues mentioned are up to discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)