Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 45EDA21F8 for ; Wed, 27 Apr 2011 14:29:05 +0000 (UTC) Received: (qmail 8091 invoked by uid 500); 27 Apr 2011 14:29:04 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 8060 invoked by uid 500); 27 Apr 2011 14:29:03 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Delivered-To: moderator for user@hbase.apache.org Received: (qmail 9413 invoked by uid 99); 27 Apr 2011 10:27:07 -0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of marginal.summer@gmail.com designates 209.85.160.169 as permitted sender) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:date:message-id:subject:from:to :content-type; bh=D+i4w3POg9AKFpxz4tXlOH6a5/ASelWBrnPTvbIbTFU=; b=j8uzauU7hBT6WSGGvIlrOfanOtwCTIlRpbDpWcogRCFulRjaMVFWc2Ndn4RgcGn+Hl ewQs5LJaR6owzUMaUMnJowdmnzFzLDlnHSNzcdLTmlybtXsg/WFFR0waSGd80qa+FFld fsWrPsrmkitgwLSHKHI7GqXIk8P7I8JqFRgI4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=ql/CxWt5LJPue2kHYKt3pkiSksRHV5egvXRTu4crr/8xe6FNtl4VqWDYnM+9oBS65K Qb09nrUri1X3yd0JrjFwE+/FKey+iIUKnQO+xdvH2i7KH8N9G5qLukoV0XicaXe/4Ly4 zlAgAwbDi8pcprZmgzkMcGTV9oStCaWbjYby0= MIME-Version: 1.0 Date: Wed, 27 Apr 2011 14:26:39 +0400 Message-ID: Subject: [CDH3U0] Cluster not processing region server failover From: Alex Romanovsky To: user@hbase.apache.org Content-Type: text/plain; charset=ISO-8859-1 Hi, I am trying failover cases on a small 3-node fully-distributed cluster of the following topology: - master node - NameNode, JobTracker, QuorumPeerMain, HMaster; - slave nodes - DataNode, TaskTracker, QuorumPeerMain, HRegionServer. ROOT and META are initially served by two different nodes. I create table 'incr' with a single column family 'value', put 'incr', '00000000', 'value:main', '00000000' to achieve a 8-byte counter cell with still human readable content, then start calling $ incr 'incr', '00000000', 'value:main', 1 once in a second or two. Then I kill -9 one of my region servers, the one that serves 'incr'. The subsequent shell incr times out. I terminate it with Ctrl-C, launch hbase-shell again and repeat the command, getting the following message repeated several times: 11/04/27 13:57:43 INFO ipc.HbaseRPC: Server at regionserver1/10.50.3.68:60020 could not be reached after 1 tries, giving up. tail master log yields the following diagnostic: 2011-04-27 14:08:32,982 INFO org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance in 0ms. Moving 1 regions off of 1 overloaded servers onto 1 less loaded servers 2011-04-27 14:08:32,982 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=incr,,1303892996561.cf314a59d3a5c79a77153f82b40015d7., src=regionserver1,60020,1303895356068, dest=regionserver2,60020,1303898049443 2011-04-27 14:08:32,982 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region incr,,1303892996561.cf314a59d3a5c79a77153f82b40015d7. (offlining) 2011-04-27 14:08:32,982 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region incr,,1303892996561.cf314a59d3a5c79a77153f82b40015d7. but it is not currently assigned anywhere hbase hbck finds 2 inconsistencies (regionserver1 down, region not served). hbase hbck -fix reports 2 initial and 1 eventual inconsistency, migrating the region to a live region server. However, when I repeat the test with regionserver2 and regionserver1 swapped (i.e. kill -9 the region server process on regionserver2, the initial evacuation target), hbcase hbck -fix throws org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed setting up proxy interface org.apache.hadoop.hbase.ipc.HRegionInterface to regionserver2/10.50.3.68:60020 after attempts=1 at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1008) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:172) at org.apache.hadoop.hbase.util.HBaseFsck.getMetaEntries(HBaseFsck.java:746) at org.apache.hadoop.hbase.util.HBaseFsck.doWork(HBaseFsck.java:133) at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:989) zookeeper.session.timeout is set to 1000 ms (i.e. 1 second), and the configuration is consistent around the cluster, so these are not the causes. Manual region reassignment also helps for the first time, and only for the first time. Subsequent retries leave 'incr' regions not assigned anywhere, and I cannot even query table regions on the client since HTable instances fail to connect. As soon as I restart the killed region server, cluster operation resumes. However, as far as I understand the HBase book, this is not the intended behavior. The cluster should automatically evacuate regions from dead region servers to known alive ones. I run the cluster on RH 5, Sun JDK 1.6.0_24. JAVA_HOME=/usr/java/jdk1.6.0_24 in hadoop-env.sh (wonder whether I should duplicate the assignment in hbase-env.sh). Is this one of the issues known to be fixed in 0.90.2 or later releases? I grepped Jira and found no matching issues described; failover scenarios mentioned there are far more complex. What other logs or config files shall I check and/or post here? Reg., Alex Romanovsky