Return-Path: Delivered-To: apmail-hbase-user-archive@www.apache.org Received: (qmail 369 invoked from network); 23 Aug 2010 23:23:20 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 23 Aug 2010 23:23:20 -0000 Received: (qmail 39879 invoked by uid 500); 23 Aug 2010 23:23:19 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 39842 invoked by uid 500); 23 Aug 2010 23:23:19 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 39834 invoked by uid 99); 23 Aug 2010 23:23:19 -0000 Received: from Unknown (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Aug 2010 23:23:19 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [208.97.132.81] (HELO homiemail-a24.g.dreamhost.com) (208.97.132.81) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Aug 2010 23:22:56 +0000 Received: from homiemail-a24.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a24.g.dreamhost.com (Postfix) with ESMTP id D373D2C8093 for ; Mon, 23 Aug 2010 16:22:34 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=mlogiciels.com; h=from :content-type:content-transfer-encoding:subject:date:message-id :to:mime-version; q=dns; s=mlogiciels.com; b=AZi8N6jHE5VV3bEOS1L IWFcelTYVRNrwldxe0XjiHOIDjqRBAQCNOwJ+EDikaSfhY4cUZ8x+TWvw9j31qE0 8RNtC66NQtCdrSpQaZamDdu8Vq7HwXc/be7cQF8dFNMzmUGxzO36wwPrJjeGuGHp GTXxeLsZFv/7ZpfHDTcnpe5I= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=mlogiciels.com; h=from :content-type:content-transfer-encoding:subject:date:message-id :to:mime-version; s=mlogiciels.com; bh=xtRpRrlAsQUPHHFulT6ObMr6N LE=; b=Dp5F2SPN+L0TdXWGEnn2bpmNzuO6HKhRabezPQUMXU+58bIz5ax1YT55x Yg4Sw/khTUXQf2WKS5cTfLDyWbViU5PpAPi3S1ksFiB/2A45dzoiG4HmyHdoS2Mc pDW568S6D5c1DcqOPNqRlZgGMve6NmsaGtwRmcEoFLiGwTOg1w= Received: from h3.mlemieux.cluster (c-24-6-246-51.hsd1.ca.comcast.net [24.6.246.51]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: mdl@mlogiciels.com) by homiemail-a24.g.dreamhost.com (Postfix) with ESMTPSA id B38F92C808D for ; Mon, 23 Aug 2010 16:22:34 -0700 (PDT) From: Matthew LeMieux Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Subject: Region servers up and running, but Master reports 0 Date: Mon, 23 Aug 2010 16:22:33 -0700 Message-Id: <10947895-9D0E-455A-A1AF-602020E88122@mlogiciels.com> To: user@hbase.apache.org Mime-Version: 1.0 (Apple Message framework v1081) X-Mailer: Apple Mail (2.1081) X-Virus-Checked: Checked by ClamAV on apache.org I have a cluster of 3 machines where the NameNode is separate from the = HMaster based on the distribution from Cloudera (CDH3). I have been = running it successfully for a couple weeks. As of this morning, it is = completely unusable. I'm looking for some help on how to fix it. = Details are below. Thank you.=20 This morning I found HBase to be unresponsive, and tried restarting it. = That didn't help. For example, running "hbase shell", and then "list" = hangs.=20 The master and region processes start up, but the master does not = recognize that the region servers are there. I am getting the following = in master's log file:=20 2010-08-23 23:04:16,100 INFO = org.apache.hadoop.hbase.master.ServerManager: 0 region servers, 0 dead, = average load NaN 2010-08-23 23:05:16,110 INFO = org.apache.hadoop.hbase.master.ServerManager: 0 region servers, 0 dead, = average load NaN 2010-08-23 23:06:16,120 INFO = org.apache.hadoop.hbase.master.ServerManager: 0 region servers, 0 dead, = average load NaN 2010-08-23 23:07:16,130 INFO = org.apache.hadoop.hbase.master.ServerManager: 0 region servers, 0 dead, = average load NaN 2010-08-23 23:08:16,140 INFO = org.apache.hadoop.hbase.master.ServerManager: 0 region servers, 0 dead, = average load NaN 2010-08-23 23:09:16,146 INFO = org.apache.hadoop.hbase.master.ServerManager: 0 region servers, 0 dead, = average load NaN 2010-08-23 23:10:16,150 INFO = org.apache.hadoop.hbase.master.ServerManager: 0 region servers, 0 dead, = average load NaN 2010-08-23 23:11:16,160 INFO = org.apache.hadoop.hbase.master.ServerManager: 0 region servers, 0 dead, = average load NaN 2010-08-23 23:12:16,170 INFO = org.apache.hadoop.hbase.master.ServerManager: 0 region servers, 0 dead, = average load NaN Meanwhile, the region servers show this in their log files:=20 2010-08-23 23:05:21,006 INFO org.apache.zookeeper.ClientCnxn: Opening = socket connection to server zookeeper:2181 2010-08-23 23:05:21,028 INFO org.apache.zookeeper.ClientCnxn: Socket = connection established to zookeeper:2181, initiating session 2010-08-23 23:05:21,168 INFO org.apache.zookeeper.ClientCnxn: Session = establishment complete on server zookeeper:2181, sessionid =3D = 0x12aa0cc2520000e, negotiated timeout =3D 40000 2010-08-23 23:05:21,172 INFO = org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, = state: SyncConnected, type: None, path: null 2010-08-23 23:05:21,177 DEBUG = org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Set watcher on = master address ZNode /hbase/master 2010-08-23 23:05:21,421 DEBUG = org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Read ZNode = /hbase/master got master:60000 2010-08-23 23:05:21,421 INFO = org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at = master:60000 that we are up 2010-08-23 23:05:22,056 INFO = org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed shutdown = hook thread: Shutdownhook:regionserver60020 The Region server process is obviously waiting on something:=20 /tmp/hbaselog$ sudo strace -p7592 Process 7592 attached - interrupt to quit futex(0x7f65534739e0, FUTEX_WAIT, 7602, NULL The Master isn't idle, it appears to be trying to do some sort of = recovery having woken up to find 0 region servers. Here is an excerpt = from it:=20 2010-08-23 23:10:06,290 DEBUG = org.apache.hadoop.hbase.regionserver.wal.HLog: Splitting hlog 12142 of = 143261: = hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020= .1282581704435, length=3D1150 2010-08-23 23:10:06,290 INFO org.apache.hadoop.hbase.util.FSUtils: = Recovering filehdfs://namenode:9000/hbase/.log = master,60020,1282577331142/master%3A60020.1282581704435 2010-08-23 23:10:06,510 INFO org.apache.hadoop.hbase.util.FSUtils: = Finished lease recover attempt for hdfs://namenode:9000/hbase/.logs = master,60020,1282577331142/master%3A60020.1282581704435 2010-08-23 23:10:06,513 DEBUG = org.apache.hadoop.hbase.regionserver.wal.HLog: Pushed=3D3 entries from = hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/1 = master%3A60020.1282581704435 2010-08-23 23:10:06,513 DEBUG = org.apache.hadoop.hbase.regionserver.wal.HLog: Splitting hlog 12143 of = 143261: = hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020= .1282581704451, length=3D448 2010-08-23 23:10:06,513 INFO org.apache.hadoop.hbase.util.FSUtils: = Recovering = filehdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A6= 0020.1282581704451 2010-08-23 23:10:06,721 INFO org.apache.hadoop.hbase.util.FSUtils: = Finished lease recover attempt for = hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020= .1282581704451 2010-08-23 23:10:06,723 DEBUG = org.apache.hadoop.hbase.regionserver.wal.HLog: Pushed=3D2 entries from = hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020= .1282581704451 2010-08-23 23:10:06,723 DEBUG = org.apache.hadoop.hbase.regionserver.wal.HLog: Splitting hlog 12144 of = 143261: = hdfs:/namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.= 1282581704468, length=3D582 2010-08-23 23:10:06,723 INFO org.apache.hadoop.hbase.util.FSUtils: = Recovering = filehdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A6= 0020.1282581704468 It looks like the Master is sequentially going through logs up to = 143261, having started at 1 and is currently at 12144. At the current = rate, it will take around 12 hours to complete. Do I have to wait for = it to complete before the master will recognize the region servers? If = it doesn't have any region servers, then what the heck is the master = doing anyway? =20 Thank you for your help,=20 Matthew