hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew LeMieux <...@mlogiciels.com>
Subject Region servers up and running, but Master reports 0
Date Mon, 23 Aug 2010 23:22:33 GMT
I have a cluster of 3 machines where the NameNode is separate from the HMaster based on the
distribution from Cloudera (CDH3).   I have been running it successfully for a couple weeks.
  As of this morning, it is completely unusable.  I'm looking for some help on how to fix
it.  Details are below.  Thank you. 

This morning I found HBase to be unresponsive, and tried restarting it.  That didn't help.
 For example, running "hbase shell", and then "list" hangs. 

The master and region processes start up, but the master does not recognize that the region
servers are there.  I am getting the following in master's log file: 

2010-08-23 23:04:16,100 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region servers,
0 dead, average load NaN
2010-08-23 23:05:16,110 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region servers,
0 dead, average load NaN
2010-08-23 23:06:16,120 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region servers,
0 dead, average load NaN
2010-08-23 23:07:16,130 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region servers,
0 dead, average load NaN
2010-08-23 23:08:16,140 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region servers,
0 dead, average load NaN
2010-08-23 23:09:16,146 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region servers,
0 dead, average load NaN
2010-08-23 23:10:16,150 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region servers,
0 dead, average load NaN
2010-08-23 23:11:16,160 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region servers,
0 dead, average load NaN
2010-08-23 23:12:16,170 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region servers,
0 dead, average load NaN


Meanwhile, the region servers show this in their log files: 

2010-08-23 23:05:21,006 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to
server zookeeper:2181
2010-08-23 23:05:21,028 INFO org.apache.zookeeper.ClientCnxn: Socket connection established
to zookeeper:2181, initiating session
2010-08-23 23:05:21,168 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete
on server zookeeper:2181, sessionid = 0x12aa0cc2520000e, negotiated timeout = 40000
2010-08-23 23:05:21,172 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper
event, state: SyncConnected, type: None, path: null
2010-08-23 23:05:21,177 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Set watcher
on master address ZNode /hbase/master
2010-08-23 23:05:21,421 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Read ZNode
/hbase/master got master:60000
2010-08-23 23:05:21,421 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master
at master:60000 that we are up
2010-08-23 23:05:22,056 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed
shutdown hook thread: Shutdownhook:regionserver60020

The Region server process is obviously waiting on something: 

/tmp/hbaselog$ sudo strace -p7592
Process 7592 attached - interrupt to quit
futex(0x7f65534739e0, FUTEX_WAIT, 7602, NULL

The Master isn't idle, it appears to be trying to do some sort of recovery having woken up
to find 0 region servers.  Here is an excerpt from it: 

2010-08-23 23:10:06,290 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Splitting hlog
12142 of 143261: hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704435,
length=1150
2010-08-23 23:10:06,290 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering filehdfs://namenode:9000/hbase/.log
master,60020,1282577331142/master%3A60020.1282581704435
2010-08-23 23:10:06,510 INFO org.apache.hadoop.hbase.util.FSUtils: Finished lease recover
attempt for hdfs://namenode:9000/hbase/.logs master,60020,1282577331142/master%3A60020.1282581704435
2010-08-23 23:10:06,513 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Pushed=3 entries
from hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/1 master%3A60020.1282581704435
2010-08-23 23:10:06,513 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Splitting hlog
12143 of 143261: hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451,
length=448
2010-08-23 23:10:06,513 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering filehdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451
2010-08-23 23:10:06,721 INFO org.apache.hadoop.hbase.util.FSUtils: Finished lease recover
attempt for hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451
2010-08-23 23:10:06,723 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Pushed=2 entries
from hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451
2010-08-23 23:10:06,723 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Splitting hlog
12144 of 143261: hdfs:/namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704468,
length=582
2010-08-23 23:10:06,723 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering filehdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704468


It looks like the Master is sequentially going through logs up to 143261, having started at
1 and is currently at 12144.   At the current rate, it will take around 12 hours to complete.
 Do I have to wait for it to complete before the master will recognize the region servers?
 If it doesn't have any region servers, then what the heck is the master doing anyway?   

Thank you for your help, 

Matthew



Mime
View raw message