Return-Path: Delivered-To: apmail-hbase-issues-archive@www.apache.org Received: (qmail 70774 invoked from network); 26 Jan 2011 00:43:05 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 26 Jan 2011 00:43:05 -0000 Received: (qmail 10471 invoked by uid 500); 26 Jan 2011 00:43:05 -0000 Delivered-To: apmail-hbase-issues-archive@hbase.apache.org Received: (qmail 10413 invoked by uid 500); 26 Jan 2011 00:43:05 -0000 Mailing-List: contact issues-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@hbase.apache.org Received: (qmail 10404 invoked by uid 99); 26 Jan 2011 00:43:04 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Jan 2011 00:43:04 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Jan 2011 00:43:04 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id p0Q0ghPY018606 for ; Wed, 26 Jan 2011 00:42:43 GMT Message-ID: <30023764.205161296002563613.JavaMail.jira@thor> Date: Tue, 25 Jan 2011 19:42:43 -0500 (EST) From: "James Kennedy (JIRA)" To: issues@hbase.apache.org Subject: [jira] Commented: (HBASE-3478) HBase fails to recover from failed DNS resolution of stale meta connection info In-Reply-To: <16087130.205121296002203969.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HBASE-3478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986792#action_12986792 ] James Kennedy commented on HBASE-3478: -------------------------------------- Configuration A: hbase.rootdir hdfs://localhost:8701/hbase hbase.master.port 60010 hbase.regionserver.port 60020 hbase.zookeeper.property.clientPort 60030 hbase.regionserver.msginterval 100 Interval between messages from the RegionServer to HMaster in milliseconds. Default is 15 sec. Set this value low if you want unit tests to be responsive. hbase.client.pause 100 Configuration B: hbase.rootdir hdfs://localhost:7701/hbase hbase.master.port 7801 hbase.regionserver.port 7802 > HBase fails to recover from failed DNS resolution of stale meta connection info > ------------------------------------------------------------------------------- > > Key: HBASE-3478 > URL: https://issues.apache.org/jira/browse/HBASE-3478 > Project: HBase > Issue Type: Bug > Components: master > Affects Versions: 0.90.1 > Reporter: James Kennedy > Fix For: 0.90.1 > > > This looks like a variant of HBASE-3445: > One of our developers ran a seed program with configuration A to generate some test data on his local machine. He then moved that data into a development environment on the same machine with a different hbase configuration B. > On startup the HMaster waits for new regionserver to register itself: > [25/01/11 15:37:25] 162161 [ HRegionServer] INFO ase.regionserver.HRegionServer - Telling master at 10.0.1.4:7801 that we are up > [25/01/11 15:37:25] 162165 [ice-EventThread] DEBUG .hadoop.hbase.zookeeper.ZKUtil - master:7801-0x12dbf879abe0000 Retrieved 13 byte(s) of data from znode /hbase/rs/10.0.1.4,7802,1295998613814 and set watcher; 10.0.1.4:7802 > Then ROOT region comes online at the right place: 10.0.1.4,7802 > [25/01/11 15:37:31] 168369 [yTasks:70236052] INFO ase.catalog.RootLocationEditor - Setting ROOT region location in ZooKeeper as 10.0.1.4:7802 > 3:57 [25/01/11 15:37:31] 168408 [10.0.1.4:7801-0] DEBUG er.handler.OpenedRegionHandler - Opened region -ROOT-,,0.70236052 on 10.0.1.4,7802,1295998613814 > But then HMaster chokes on the stale META region location. > [25/01/11 15:37:31] 168448 [ HMaster] ERROR he.hadoop.hbase.HServerAddress - Could not resolve the DNS name of warren:60020 > [25/01/11 15:37:31] 168448 [ HMaster] FATAL he.hadoop.hbase.master.HMaster - Unhandled exception. Starting shutdown. > java.lang.IllegalArgumentException: Could not resolve the DNS name of warren:60020 > at org.apache.hadoop.hbase.HServerAddress.checkBindAddressCanBeResolved(HServerAddress.java:105) > at org.apache.hadoop.hbase.HServerAddress.(HServerAddress.java:66) > at org.apache.hadoop.hbase.catalog.MetaReader.readLocation(MetaReader.java:344) > at org.apache.hadoop.hbase.catalog.MetaReader.readMetaLocation(MetaReader.java:281) > at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:280) > at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyMetaRegionLocation(CatalogTracker.java:482) > at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:435) > at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:382) > at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:277) > at java.lang.Thread.run(Thread.java:680) > First of all, we do not yet understand why in configuration A the RegionInfo resolved to "warren:60020" whereas in configuration B we get "10.0.1.4:7802". The port numbers make sense but not the "warren" hostname. It's probably specific to Warren's mac environment somehow because no other developer gets this problem when doing the same thing. "warren" isn't in his hosts file so that remains a mystery. > But irrespective of that, since the ports differ we expect the stale meta connection data to cause connection failure anyway. Perhaps in the form of a SocketTimeoutException as in hbase-3445. > But shouldn't the HMaster handle that by catching the exception and letting verifyMetaRegionLocation() fail so that meta regions get reassigned to the new region server? > Probably the safeguards in CatalogTracker.getCachedConnection() should move up to getMetaServerConnection() so as to encompass MetaReader.readMetaLocation() also. Essentially if getMetaServerConnection() encounters ANY exception connection to meta RegionServer it should probably just return null to force meta region reassignment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.