Return-Path: Delivered-To: apmail-hbase-user-archive@www.apache.org Received: (qmail 98891 invoked from network); 1 Apr 2011 17:28:06 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 1 Apr 2011 17:28:06 -0000 Received: (qmail 68450 invoked by uid 500); 1 Apr 2011 17:28:05 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 68422 invoked by uid 500); 1 Apr 2011 17:28:05 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 68414 invoked by uid 99); 1 Apr 2011 17:28:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Apr 2011 17:28:05 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of RBRUSH@cerner.com designates 159.140.213.140 as permitted sender) Received: from [159.140.213.140] (HELO xmail01.cerner.com) (159.140.213.140) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Apr 2011 17:27:59 +0000 Received: from GRDMSGHUBWHQV02.cerner.net ([10.160.17.46]) by xmail01.cerner.com (8.14.4/8.14.4) with ESMTP id p31HRbaj019055 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT) for ; Fri, 1 Apr 2011 12:27:37 -0500 Received: from EMAIL2.cerner.net ([169.254.1.169]) by GRDMSGHUBWHQV02.cerner.net ([10.160.17.46]) with mapi; Fri, 1 Apr 2011 12:27:37 -0500 From: "Brush,Ryan" To: "user@hbase.apache.org" Date: Fri, 1 Apr 2011 12:27:30 -0500 Subject: Re: NoRouteToHostException causes Master abort when the RegionServer hosting ROOT is not available Thread-Topic: NoRouteToHostException causes Master abort when the RegionServer hosting ROOT is not available Thread-Index: AcvwkhgR4rSVjtrNRzGdijU1NT07dw== Message-ID: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: user-agent: Microsoft-MacOutlook/14.2.0.101115 acceptlanguage: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.2.15,1.0.148,0.0.0000 definitions=2011-04-01_07:2011-04-01,2011-04-01,1970-01-01 signatures=0 X-Virus-Checked: Checked by ClamAV on apache.org I've verified this was indeed caused by HBASE-3660, and it fixed the issue in our environment. Thanks! On 4/1/11 10:57 AM, "Stack" wrote: >The below looks like HBASE-3660, 'HMaster will exit when starting with >stale data in cached locations such as -ROOT- or .META.', included in >0.90.2 RC. >St.Ack > >On Fri, Apr 1, 2011 at 8:48 AM, Brush,Ryan wrote: >> This happens in similar conditions but is distinct from HBASE-3617. >>When the region hosting ROOT isn't available during restart, the >>NoRouteToHostException propagates all the way up the call stack and >>causes the master to abort. It looks like this can be addressed by >>handling NoRouteToHostException at some point and considering that >>node/region server offline. >> >> I applied the patch from HBASE-3617 and it didn't fix the problem I'm >>seeing, which I expected given the stack trace below. Assuming this >>reasoning is correct, does this merit a separate JIRA? It does seem >>critical in that the failure of a single node is preventing us from >>being up our cluster. >> >> 2011-04-01 10:15:19,472 INFO >>org.apache.hadoop.hbase.master.ServerManager: Exiting wait on >>regionserver(s) to checkin; count=3D2, stopped=3Dfalse, count of regions = out >>on cluster=3D0 >> 2011-04-01 10:15:19,486 INFO >>org.apache.hadoop.hbase.master.MasterFileSystem: Log folder >>hdfs://iphadoop01:9000/hbase/.logs/iphadoop03.northamerica.cerner.net,600 >>20,1301665635981 belongs to an existing region server >> 2011-04-01 10:15:19,486 INFO >>org.apache.hadoop.hbase.master.MasterFileSystem: Log folder >>hdfs://iphadoop01:9000/hbase/.logs/iphadoop05.northamerica.cerner.net,600 >>20,1301665659785 belongs to an existing region server >> 2011-04-01 10:15:22,508 FATAL org.apache.hadoop.hbase.master.HMaster: >>Unhandled exception. Starting shutdown. >> java.net.NoRouteToHostException: No route to host >> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) >> at=20 >>sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567) >> at=20 >>org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.jav >>a:206) >> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408) >> at=20 >>org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseCl >>ient.java:328) >> at=20 >>org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:88 >>3) >> at=20 >>org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750) >> at=20 >>org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) >> at $Proxy6.getProtocolVersion(Unknown Source) >> at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:419) >> at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:393) >> at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:444) >> at=20 >>org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:349) >> at=20 >>org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementati >>on.getHRegionConnection(HConnectionManager.java:954) >> at=20 >>org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(Catalo >>gTracker.java:385) >> at=20 >>org.apache.hadoop.hbase.catalog.CatalogTracker.waitForRootServerConnectio >>n(CatalogTracker.java:211) >> at=20 >>org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRootRegionLocation(C >>atalogTracker.java:458) >> at=20 >>org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:425 >>) >> at=20 >>org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java: >>383) >> at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:278) >> 2011-04-01 10:15:22,510 INFO org.apache.hadoop.hbase.master.HMaster: >>Aborting >> 2011-04-01 10:15:22,510 DEBUG org.apache.hadoop.hbase.master.HMaster: >>Stopping service threads >> >> ---------------------------------------------------------------------- >> CONFIDENTIALITY NOTICE This message and any included attachments are >>from Cerner Corporation and are intended only for the addressee. The >>information contained in this message is confidential and may constitute >>inside or non-public information under international, federal, or state >>securities laws. Unauthorized forwarding, printing, copying, >>distribution, or use of such information is strictly prohibited and may >>be unlawful. If you are not the addressee, please promptly delete this >>message and notify the sender of the delivery error by e-mail or you may >>call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) >>(816)221-1024. >>