hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jorn Argelo - Ephorus" <Jorn.Arg...@ephorus.com>
Subject HBase Master not picking up dead regionserver
Date Fri, 16 Sep 2011 09:31:15 GMT
Hi all,

 

I'm in the process of testing our small cluster running the CDH3U1
version of Hadoop / Hbase. I'm currently having the problem when I stop
a regionserver (either cleanly or kill it hard) that the HBase master is
not detecting that the regionserver is dead. If I do this to the
regionserver running the META region then the entire cluster is
completely unusable because the HBase master is not moving the META
region to another regionserver. It simply keeps on trying to reconnect
to the dead regionserver and it stays there forever, even up to the
level it renders the entire cluster unusable. Here's a snapshot of the
error in the hbase master log (and for the record it's datanode03 which
is the one that is dead):

 

 

2011-09-16 11:22:12,514 DEBUG
org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing
plan for region ephorus_test,
/entries/liberalism/,1315833925382.918c3035c5387c00e8d6589f7dce64e7.;
plan=hri=ephorus_test,
/entries/liberalism/,1315833925382.918c3035c5387c00e8d6589f7dce64e7.,
src=datanode01.dev.ephorus-labs.com,60020,1316078209570,
dest=datanode03.dev.ephorus-labs.com,60020,1316162005809

2011-09-16 11:22:12,514 DEBUG
org.apache.hadoop.hbase.master.AssignmentManager: Assigning region
ephorus_test,
/entries/liberalism/,1315833925382.918c3035c5387c00e8d6589f7dce64e7. to
datanode03.dev.ephorus-labs.com,60020,1316162005809

2011-09-16 11:22:12,514 WARN
org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for
region 05f13ffa2ec18aac9ffa6f79a23c12b2 from server
datanode02.dev.ephorus-labs.com,60020,1316078218061 but region was in
the state
TestTable,0009796041,1316100506914.05f13ffa2ec18aac9ffa6f79a23c12b2.
state=OPEN, ts=1316164932386 and not in expected PENDING_OPEN or OPENING
states

2011-09-16 11:22:12,514 WARN
org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of
ephorus_test,
/entries/liberalism/,1315833925382.918c3035c5387c00e8d6589f7dce64e7. to
serverName=datanode03.dev.ephorus-labs.com,60020,1316162005809,
load=(requests=0, regions=8, usedHeap=42, maxHeap=4083), trying to
assign elsewhere instead; retry=0

java.net.ConnectException: Connection refused

        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

        at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)

        at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.ja
va:206)

        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408)

        at
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseC
lient.java:328)

        at
org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:8
83)

        at
org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)

        at
org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)

        at $Proxy6.openRegion(Unknown Source)

        at
org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManage
r.java:559)

        at
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManage
r.java:931)

        at
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManage
r.java:746)

        at
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManage
r.java:726)

        at
org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(Close
dRegionHandler.java:92)

        at
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)

        at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecuto
r.java:886)

        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.ja
va:908)

        at java.lang.Thread.run(Thread.java:662)

 

Maybe worthwhile to say that this behavior is the same regardless if the
cluster is idle or loaded. Apart from that (and some infamous
stop-the-world GC issues which I got to fix) the cluster is running
fine.

 

For reference: the zookeeper ensemble is properly terminating the
session as we can see here:

 

2011-09-16 10:33:25,988 - INFO  [CommitProcessor:1:NIOServerCnxn@1580] -
Established session 0x1324d1aa92a01bb with negotiated timeout 40000 for
client /10.20.4.98:47238

2011-09-16 10:33:29,180 - INFO
[ProcessThread:-1:PrepRequestProcessor@407] - Got user-level
KeeperException when processing sessionid:0x1324d1aa92a01bb type:create
cxid:0xd zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error
Path:/hbase/rs/datanode03,60020,1316162005809 Error:KeeperErrorCode =
NodeExists for /hbase/rs/datanode03,60020,1316162005809

2011-09-16 10:34:06,414 - INFO
[ProcessThread:-1:PrepRequestProcessor@387] - Processed session
termination for sessionid: 0x2324dad8d770170

2011-09-16 10:34:06,430 - INFO
[ProcessThread:-1:PrepRequestProcessor@387] - Processed session
termination for sessionid: 0x1324d1aa92a01bb

2011-09-16 10:34:06,438 - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed
socket connection for client /10.20.4.98:47238 which had sessionid
0x1324d1aa92a01bb

 

I can also confirm in the zk_dump found in the hbase master web UI that
the zookeeper ensemble no longer has the session active yet the HBase
master does not detect this. However, the hbase shell still reports that
all servers are alive:

 

hbase(main):001:0> status

3 servers, 0 dead, 96.3333 average load

 

Maybe I am missing something obvious but I'm quite stumped on this. I
found a thread on Google where J-D suggested the session timeout, but
nothing happens if I let it run overnight (so that is 12 hours+). You
can find it here:
http://apache-hbase.679495.n3.nabble.com/Can-master-detect-sudden-region
-server-death-td1141384.html

 

The only way for the HBase master to detect that the regionserver is
dead is by restarting the HBase master ... which is frankly not really
what I want.

 

Any pointers would be greatly appreciated.

 

Thanks,

Jorn


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message