Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9644B105FC for ; Fri, 2 Aug 2013 14:21:26 +0000 (UTC) Received: (qmail 87562 invoked by uid 500); 2 Aug 2013 14:21:23 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 86882 invoked by uid 500); 2 Aug 2013 14:21:19 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 86866 invoked by uid 99); 2 Aug 2013 14:21:18 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Aug 2013 14:21:18 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of julian.zhou@me.com designates 17.172.81.3 as permitted sender) Received: from [17.172.81.3] (HELO st11p00mm-asmtp004.mac.com) (17.172.81.3) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Aug 2013 14:21:12 +0000 Received: from [9.115.125.153] (www-900.ibm.com [202.108.130.138]) by st11p00mm-asmtp004.mac.com (Oracle Communications Messaging Server 7u4-27.07(7.0.4.27.6) 64bit (built Jun 21 2013)) with ESMTPSA id <0MQW001JDQIETI60@st11p00mm-asmtp004.mac.com>; Fri, 02 Aug 2013 14:20:52 +0000 (GMT) Message-id: <51FBC02E.8040100@me.com> Date: Fri, 02 Aug 2013 22:20:30 +0800 From: Julian Zhou User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130620 Thunderbird/17.0.7 MIME-version: 1.0 To: user@hbase.apache.org, dev@hbase.apache.org Subject: Long waiting loop for " Waiting for region servers count to settle" when doing hmaster failover Content-type: multipart/alternative; boundary=------------040402030106090603010806 X-Virus-Checked: Checked by ClamAV on apache.org --------------040402030106090603010806 Content-Type: text/plain; charset=GB2312 Content-Transfer-Encoding: 7bit Hi Commnunity, When I do a testing, I met this issue on 0.94.3. There are 1 active hmaster, 1 backup hmaster, 4 region servers. I run YCSB workload on it to load data. During the running of workload, I manually kill -9 the active hmaster, seems that backup master took over the active role quickly, but looping on " INFO org.apache.hadoop.hbase.master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms. INFO org.apache.hadoop.hbase.master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for xxx ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms. INFO org.apache.hadoop.hbase.master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for xxx ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms. ... ... ... ... INFO org.apache.hadoop.hbase.master.ServerManager: Waiting for region servers count to settle; currently checked in 1, slept for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms. INFO org.apache.hadoop.hbase.master.ServerManager: Waiting for region servers count to settle; currently checked in 2, slept for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms. INFO org.apache.hadoop.hbase.master.ServerManager: Waiting for region servers count to settle; currently checked in 3, slept for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms. INFO org.apache.hadoop.hbase.master.ServerManager: Waiting for region servers count to settle; currently checked in 4, slept for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms. " It seems there always a looping of 5 - 7 mins for the above waiting message for region servers to checked in to the new active master. Then after a long wait loop, it suddenly checked in 4 region servers successfully. Any idea of this waiting loop? Thanks a lot for the advice~ -- Best Regards, Julian --------------040402030106090603010806--