Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3FE8F4D28 for ; Mon, 27 Jun 2011 17:59:08 +0000 (UTC) Received: (qmail 441 invoked by uid 500); 27 Jun 2011 17:59:06 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 281 invoked by uid 500); 27 Jun 2011 17:59:06 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 273 invoked by uid 99); 27 Jun 2011 17:59:05 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Jun 2011 17:59:05 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jdcryans@gmail.com designates 209.85.213.41 as permitted sender) Received: from [209.85.213.41] (HELO mail-yw0-f41.google.com) (209.85.213.41) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Jun 2011 17:59:01 +0000 Received: by ywb26 with SMTP id 26so2527078ywb.14 for ; Mon, 27 Jun 2011 10:58:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type :content-transfer-encoding; bh=cjk8VeYv1zZUk56GO/NL45RVZonHa/06uguD8L16fzI=; b=OBkj0QB78JZXnbouxGa+C3tVnUIhUEEeM9WZp4f7EU5hoqjWRMwuBkZWItwxjjPf5m JI83NvyMWKsyvZldp/KnVtNgaYQNRoNJ8Ak1BSRB45fk2jXppeM3qcmHD0OmbjsGy/9/ RwtNSErDmfvfSWnH+gBza6+IrMQJmvL0b/ki0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type :content-transfer-encoding; b=WwIm9Hm3JEtUX8azAcln/cQdYYSPnf04Nxy9nKnXV5CnutoZwFKxdFGevEOmmJKA9b AnLu4SmZUmvZ3KkPcIwj82UIMdTeuwbPBafGftzSUQXXs4QOBCIORTQLU8r8zLqwcETn F5HXb8bRBMNu2Hot0e2QTRnZLj6dpFP2GuhcY= MIME-Version: 1.0 Received: by 10.100.35.3 with SMTP id i3mr6964290ani.30.1309197519745; Mon, 27 Jun 2011 10:58:39 -0700 (PDT) Sender: jdcryans@gmail.com Received: by 10.100.226.14 with HTTP; Mon, 27 Jun 2011 10:58:39 -0700 (PDT) In-Reply-To: References: Date: Mon, 27 Jun 2011 10:58:39 -0700 X-Google-Sender-Auth: 5_PBE1-67xT78mvW8FZGTN4TeJw Message-ID: Subject: Re: RegionServer not dying, and Master not removing RegionServer From: Jean-Daniel Cryans To: user@hbase.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Did you jstack the region server before killing it? We could see what was still living, although logs might also give us a clue. As long as the client ZK thread is alive, the znode will stay up. About the last message: > org.apache.hadoop.hbase.zookeeper.RegionServerTracker: No HServerInfo fou= nd > for H3S3,60020,1308946657608 That's usually given when you have a DNS issue, where the hostname and FQDN are mixed up. See http://hbase.apache.org/book/os.html#dns J-D On Mon, Jun 27, 2011 at 10:45 AM, Matt Davies wrote: > All, > > You may have seen my previous email regarding Master node crashing, but u= pon > further research we may have other issues. > > I went out to the RegionServer (H3S3) and verified that the process was > running. =A0There was no meaningful log output, and the last output was f= rom > 48 hours ago indicating a log roll. > > *Previous State:* > Master died (H3M1) > > *Corrective Actions:* > I restarted the master out on H3M1, and it started by trying to recover l= ogs > out on H3S3. =A0There were many entries like > > 2011-06-27 11:36:25,639 WARN org.apache.hadoop.hbase.util.FSUtils: Waited > 1701352ms for lease recovery on > hdfs://H3M1:8020/hbase/.logs/H3S3,60020,1308946657608/H3S3%3A60020.130901= 5083129:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: > failed to create file > /hbase/.logs/H3S3,60020,1308946657608/H3S3%3A60020.1309015083129 for > DFSClient_hb_m_H3M1:60000_1309194478697 on client 10.x.x.x, because this > file is already being created by > DFSClient_hb_rs_H3S3,60020,1308946657608_1308946657819 on 10.5.241.203 > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSN= amesystem.java:1196) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesyst= em.java:1284) > at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:5= 96) > at sun.reflect.GeneratedMethodAccessor48.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorI= mpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1416) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1412) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation= .java:1115) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1410) > > > Then we decided to kill -9 the regionserver process out on H3S3, and > received the following in the master: > > 2011-06-27 11:36:26,006 INFO > org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer > ephemeral node deleted, processing expiration [H3S3,60020,1308946657608] > 2011-06-27 11:36:26,006 INFO > org.apache.hadoop.hbase.zookeeper.RegionServerTracker: No HServerInfo fou= nd > for H3S3,60020,1308946657608 > > > Ultimately, the master came up and transitioned the regions to a differen= t > regionserver. > > Is there a situation where the regionserver may become unresponsive yet t= he > zookeeper client portion of the process can still check in and tickle > zookeeper so the master thinks it alive? =A0I think this may be why, in m= y > previous post, the master tried to assign to H3S3 and died in the attempt= . > > > Thanks! > -Matt >