Return-Path: Delivered-To: apmail-hadoop-hbase-dev-archive@minotaur.apache.org Received: (qmail 1448 invoked from network); 3 Aug 2009 00:47:32 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 3 Aug 2009 00:47:32 -0000 Received: (qmail 68545 invoked by uid 500); 3 Aug 2009 00:47:36 -0000 Delivered-To: apmail-hadoop-hbase-dev-archive@hadoop.apache.org Received: (qmail 68479 invoked by uid 500); 3 Aug 2009 00:47:36 -0000 Mailing-List: contact hbase-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-dev@hadoop.apache.org Delivered-To: mailing list hbase-dev@hadoop.apache.org Received: (qmail 68469 invoked by uid 99); 3 Aug 2009 00:47:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Aug 2009 00:47:36 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Aug 2009 00:47:34 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id C8622234C004 for ; Sun, 2 Aug 2009 17:47:14 -0700 (PDT) Message-ID: <940002973.1249260434806.JavaMail.jira@brutus> Date: Sun, 2 Aug 2009 17:47:14 -0700 (PDT) From: "stack (JIRA)" To: hbase-dev@hadoop.apache.org Subject: [jira] Commented: (HBASE-1736) If RS can't talk to master, pause; more importantly, don't split (Currently we do and splits are lost and table is wounded) In-Reply-To: <1952216651.1249258094827.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HBASE-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12738162#action_12738162 ] stack commented on HBASE-1736: ------------------------------ So, I notice that RS has a watcher on master. We got this: {code} 2009-08-01 19:29:38,018 [main-EventThread] INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: NodeDeleted, path: /hbase/master {code} .. but all we do is reset the watcher: {code} } else if (type == EventType.NodeDeleted) { watchMasterAddress(); {code} We should set a flag that stops splitting -- take out the CompactSplitThread#lock -- until we get NodeCreated (NodeCreated does getMaster() ... could release lock too...). Holding lock would hold up the CompactSplitThead... it does compactions too... probably not whats wanted. > If RS can't talk to master, pause; more importantly, don't split (Currently we do and splits are lost and table is wounded) > --------------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-1736 > URL: https://issues.apache.org/jira/browse/HBASE-1736 > Project: Hadoop HBase > Issue Type: Bug > Reporter: stack > Fix For: 0.20.1 > > > What I saw was master shutting itself down because it had lost zk lease. Fine. The RS though doesn't look like it can deal with this situation. We'll see stuff like this: > {code} > ...failed on connection exception: java.net.ConnectException: Connection refused > at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:744) > at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:722) > at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:328) > at $Proxy0.regionServerReport(Unknown Source) > at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:470) > at java.lang.Thread.run(Unknown Source) > Caused by: java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source) > at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404) > at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:305) > at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:826) > at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:707) > ... 4 more > {code} > ... all over the regionserver as it tries to send heartbeat to master on this broken connection. > On split, we close parent, add children to the catalog but then when we try to tell the master about the split, it fails. Means the children never get deployed. Meantime the parent is offline. > This issue is about going through the regionserver and anytime it has a connection to master, make sure on fault that no damage is done the table and then that the regionserver puts a pause on splitting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.