Return-Path: Delivered-To: apmail-hadoop-hbase-dev-archive@locus.apache.org Received: (qmail 66896 invoked from network); 26 Jan 2009 20:14:30 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 26 Jan 2009 20:14:30 -0000 Received: (qmail 27682 invoked by uid 500); 26 Jan 2009 20:14:30 -0000 Delivered-To: apmail-hadoop-hbase-dev-archive@hadoop.apache.org Received: (qmail 27667 invoked by uid 500); 26 Jan 2009 20:14:30 -0000 Mailing-List: contact hbase-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-dev@hadoop.apache.org Delivered-To: mailing list hbase-dev@hadoop.apache.org Received: (qmail 27656 invoked by uid 99); 26 Jan 2009 20:14:29 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Jan 2009 12:14:29 -0800 X-ASF-Spam-Status: No, hits=-1998.5 required=10.0 tests=ALL_TRUSTED,WEIRD_PORT X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Jan 2009 20:14:21 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id AF51D234C48D for ; Mon, 26 Jan 2009 12:13:59 -0800 (PST) Message-ID: <309594460.1233000839717.JavaMail.jira@brutus> Date: Mon, 26 Jan 2009 12:13:59 -0800 (PST) From: "Jim Kellerman (JIRA)" To: hbase-dev@hadoop.apache.org Subject: [jira] Commented: (HBASE-1123) Server never leaves the dead list though logs have all been processed if crashed server had -ROOT- (seemingly) In-Reply-To: <901846260.1231825021566.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HBASE-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667384#action_12667384 ] Jim Kellerman commented on HBASE-1123: -------------------------------------- On hbase-0.19 branch, I could not reproduce this. I killed the server holding root while cluster was under load and it exited the waiting state in 1:01 (min:secs): {code} 2009-01-26 19:26:32,266 INFO org.apache.hadoop.hbase.master.RegionManager: assigning region -ROOT-,,0 to server 208.76.44.141:8020 2009-01-26 19:41:08,396 INFO org.apache.hadoop.hbase.master.ServerManager: 208.76.44.141:8020 lease expired 2009-01-26 19:42:10,757 DEBUG org.apache.hadoop.hbase.master.RegionServerOperation: Removed 208.76.44.141:8020 from deadservers Map {code} I then waited for the cluster to rebalance, again put it under load, and killed the server holding the root region. It took a little longer (2 min 19 sec) before the server was removed from the dead list. {code} 2009-01-26 19:41:11,808 INFO org.apache.hadoop.hbase.master.RegionManager: assigning region -ROOT-,,0 to server 208.76.44.139:8020 2009-01-26 19:49:01,966 INFO org.apache.hadoop.hbase.master.ServerManager: 208.76.44.139:8020 lease expired 2009-01-26 19:51:20,354 DEBUG org.apache.hadoop.hbase.master.RegionServerOperation: Removed 208.76.44.139:8020 from deadservers Map {code} However, if leases included the start code, we could have put the restarted server back into service much sooner, as it would not interfere with the splitting of logs (which include the start code in their name). > Server never leaves the dead list though logs have all been processed if crashed server had -ROOT- (seemingly) > -------------------------------------------------------------------------------------------------------------- > > Key: HBASE-1123 > URL: https://issues.apache.org/jira/browse/HBASE-1123 > Project: Hadoop HBase > Issue Type: Bug > Affects Versions: 0.19.0 > Reporter: stack > Assignee: Jim Kellerman > Fix For: 0.20.0 > > Attachments: 1123.patch > > > Cluster is just hung after host that had -ROOT- completed splitting its logs... old server is just stuck on the dead list and never comes off it. > {code} > .. > 2009-01-13 01:09:36,448 [HMaster] DEBUG org.apache.hadoop.hbase.regionserver.HLog: Splitting 6 of 6: hdfs://aa0-000-12.u.powerset.com:9000/hbasetrunk2/log_XX.XX.XX.142_1231717984112_60020/hlog.dat.1231718928939 > 2009-01-13 01:09:37,396 [IPC Server handler 4 on 60000] DEBUG org.apache.hadoop.hbase.master.ServerManager: Waiting on XX.XX.XX142:60020 removal from dead list before processing report-for-duty request > 2009-01-13 01:09:38,591 [HMaster] DEBUG org.apache.hadoop.hbase.regionserver.HLog: Creating new log file writer for path hdfs://aa0-000-12.u.powerset.com:9000/hbasetrunk2/TestTable/712889985/oldlogfile.log and region TestTable,0040922294,1231559109829 > 2009-01-13 01:09:38,670 [HMaster] DEBUG org.apache.hadoop.hbase.regionserver.HLog: Creating new log file writer for path hdfs://aa0-000-12.u.powerset.com:9000/hbasetrunk2/TestTable/484208094/oldlogfile.log and region TestTable,0042007133,1231628296909 > 2009-01-13 01:09:45,096 [HMaster] INFO org.apache.hadoop.hbase.regionserver.HLog: log file splitting completed for hdfs://aa0-000-12.u.powerset.com:9000/hbasetrunk2/log_XX.XX.XX.142_1231717984112_60020 > 2009-01-13 01:09:47,317 [SocketListener0-2] DEBUG org.apache.hadoop.hbase.client.HConnectionManager$TableServers: Cache hit for row <> in tableName .META.: location serverXX.XX.XX.142:60020, location region name .META.,,1 > 2009-01-13 01:09:47,416 [IPC Server handler 4 on 60000] DEBUG org.apache.hadoop.hbase.master.ServerManager: Waiting on XX.XX.XX142:60020 removal from dead list before processing report-for-duty request > 2009-01-13 01:09:47,518 [IPC Server handler 3 on 60000] INFO org.apache.hadoop.hbase.master.RegionManager: assigning region -ROOT-,,0 to server XX.XX.XX141:60020 > 2009-01-13 01:09:49,007 [IPC Server handler 6 on 60000] DEBUG org.apache.hadoop.hbase.master.ServerManager: Total Load: 430, Num Servers: 3, Avg Load: 144.0 > 2009-01-13 01:09:50,219 [SocketListener0-0] DEBUG org.apache.hadoop.hbase.client.HConnectionManager$TableServers: Cache hit for row <> in tableName .META.: location server XX.XX.XX.142:60020, location region name .META.,,1 > 2009-01-13 01:09:50,539 [IPC Server handler 2 on 60000] INFO org.apache.hadoop.hbase.master.ServerManager: Received MSG_REPORT_PROCESS_OPEN: -ROOT-,,0 from XX.XX.XX.141:60020 > 2009-01-13 01:09:50,539 [IPC Server handler 2 on 60000] INFO org.apache.hadoop.hbase.master.ServerManager: Received MSG_REPORT_OPEN: -ROOT-,,0 from 208.76.44.141:60020 > 2009-01-13 01:09:50,719 [SocketListener0-3] DEBUG org.apache.hadoop.hbase.client.HConnectionManager$TableServers: Cache hit for row <> in tableName .META.: location server XX.XX.XX.142:60020, location region name .META.,,1 > 2009-01-13 01:09:50,967 [SocketListener0-4] DEBUG org.apache.hadoop.hbase.client.HConnectionManager$TableServers: Cache hit for row <> in tableName .META.: location serverXX.XX.XX.142:60020, location region name .META.,,1 > 2009-01-13 01:09:52,117 [SocketListener0-5] DEBUG org.apache.hadoop.hbase.client.HConnectionManager$TableServers: Cache hit for row <> in tableName .META.: location server XX.XX.XX.142:60020, location region name .META.,,1 > .... > 2009-01-13 01:09:57,426 [IPC Server handler 4 on 60000] DEBUG org.apache.hadoop.hbase.master.ServerManager: Waiting on XX.XX.XX.142:60020 removal from dead list before processing report-for-duty request > .... > 2009-01-13 01:10:45,156 [HMaster] DEBUG org.apache.hadoop.hbase.master.HMaster: Processing todo: ProcessServerShutdown of XX.XX.XX142:60020 > 2009-01-13 01:10:45,156 [HMaster] INFO org.apache.hadoop.hbase.master.RegionServerOperation: process shutdown of server XX.XX.XX.142:60020: logSplit: true, rootRescanned: false, numberOfMetaRegions: 1, onlineMetaRegions.size(): 1 > 2009-01-13 01:10:45,156 [HMaster] DEBUG org.apache.hadoop.hbase.master.ProcessServerShutdown$ScanRootRegion: process server shutdown scanning root region on XX.XX.XX.141 > 2009-01-13 01:10:45,182 [HMaster] DEBUG org.apache.hadoop.hbase.master.RegionServerOperation: process server shutdown scanning root region on XX.XX.XX.141 finished HMaster > 2009-01-13 01:10:45,183 [HMaster] DEBUG org.apache.hadoop.hbase.master.ProcessServerShutdown$ScanMetaRegions: process server shutdown scanning .META.,,1 on XX.XX.XX.142:60020 > 2009-01-13 01:10:47,496 [IPC Server handler 4 on 60000] DEBUG org.apache.hadoop.hbase.master.ServerManager: Waiting on XX.XX.XX.142:60020 removal from dead list before processing report-for-duty request > 2009-01-13 01:10:49,320 [IPC Server handler 8 on 60000] DEBUG org.apache.hadoop.hbase.master.ServerManager: Total Load: 431, Num Servers: 3, Avg Load: 144.0 > ..... > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.