hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jorn Argelo - Ephorus" <Jorn.Arg...@ephorus.com>
Subject RE: HBase Master not picking up dead regionserver
Date Mon, 19 Sep 2011 07:26:14 GMT
Thanks J-D, this was indeed the problem.

My problem specifically was that my DNS was correct but the hostfile of the regionservers
was having the shortname first and the FQDN second. This meant that the RS was reporting its
shortname to the ZK ensemble and not the FQDN, while the HBase master was expecting an FQDN.
I have removed this entry from the hosts file which means that hostname resolving is now entirely
dependent on DNS (which is what we want as that ensures consistency). 

Robert: We are not using EC2 for our HBase cluster no. Thanks for your pointers though.

Regards,
Jorn

-----Oorspronkelijk bericht-----
Van: jdcryans@gmail.com [mailto:jdcryans@gmail.com] Namens Jean-Daniel Cryans
Verzonden: vrijdag 16 september 2011 19:34
Aan: user@hbase.apache.org
Onderwerp: Re: HBase Master not picking up dead regionserver

This happens often to users with a broken reverse DNS setup, look at
the master log around when it was supposed to process the dead node
and it should tell you that it doesn't know who that is (because the
server name it sees is different from the one registered in the
master).

One example from http://search-hadoop.com/m/CANUA1qRCkQ1

10567 2011-07-14 18:56:04,530 INFO
org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer
ephemeral node deleted, processing expiration
[server-2.domain.net.,60020,1310680454144]
10568 2011-07-14 18:56:04,530 INFO
org.apache.hadoop.hbase.zookeeper.RegionServerTracker: No HServerInfo
found for server-2.domain.net.,60020,1310680454144

You can see in their RS log:

2011-07-14 18:56:03,423 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: stopping server
at: server-2.domain.net,60020,1310680454144

server-2.domain.net,60020 != server-2.domain.net,60020.

J-D

On Fri, Sep 16, 2011 at 2:31 AM, Jorn Argelo - Ephorus
<Jorn.Argelo@ephorus.com> wrote:
> Hi all,
>
>
>
> I'm in the process of testing our small cluster running the CDH3U1
> version of Hadoop / Hbase. I'm currently having the problem when I stop
> a regionserver (either cleanly or kill it hard) that the HBase master is
> not detecting that the regionserver is dead. If I do this to the
> regionserver running the META region then the entire cluster is
> completely unusable because the HBase master is not moving the META
> region to another regionserver. It simply keeps on trying to reconnect
> to the dead regionserver and it stays there forever, even up to the
> level it renders the entire cluster unusable. Here's a snapshot of the
> error in the hbase master log (and for the record it's datanode03 which
> is the one that is dead):
>
>
>
>
>
> 2011-09-16 11:22:12,514 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing
> plan for region ephorus_test,
> /entries/liberalism/,1315833925382.918c3035c5387c00e8d6589f7dce64e7.;
> plan=hri=ephorus_test,
> /entries/liberalism/,1315833925382.918c3035c5387c00e8d6589f7dce64e7.,
> src=datanode01.dev.ephorus-labs.com,60020,1316078209570,
> dest=datanode03.dev.ephorus-labs.com,60020,1316162005809
>
> 2011-09-16 11:22:12,514 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning region
> ephorus_test,
> /entries/liberalism/,1315833925382.918c3035c5387c00e8d6589f7dce64e7. to
> datanode03.dev.ephorus-labs.com,60020,1316162005809
>
> 2011-09-16 11:22:12,514 WARN
> org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for
> region 05f13ffa2ec18aac9ffa6f79a23c12b2 from server
> datanode02.dev.ephorus-labs.com,60020,1316078218061 but region was in
> the state
> TestTable,0009796041,1316100506914.05f13ffa2ec18aac9ffa6f79a23c12b2.
> state=OPEN, ts=1316164932386 and not in expected PENDING_OPEN or OPENING
> states
>
> 2011-09-16 11:22:12,514 WARN
> org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of
> ephorus_test,
> /entries/liberalism/,1315833925382.918c3035c5387c00e8d6589f7dce64e7. to
> serverName=datanode03.dev.ephorus-labs.com,60020,1316162005809,
> load=(requests=0, regions=8, usedHeap=42, maxHeap=4083), trying to
> assign elsewhere instead; retry=0
>
> java.net.ConnectException: Connection refused
>
>        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>
>        at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
>
>        at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.ja
> va:206)
>
>        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408)
>
>        at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseC
> lient.java:328)
>
>        at
> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:8
> 83)
>
>        at
> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
>
>        at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
>
>        at $Proxy6.openRegion(Unknown Source)
>
>        at
> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManage
> r.java:559)
>
>        at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManage
> r.java:931)
>
>        at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManage
> r.java:746)
>
>        at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManage
> r.java:726)
>
>        at
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(Close
> dRegionHandler.java:92)
>
>        at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
>
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecuto
> r.java:886)
>
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.ja
> va:908)
>
>        at java.lang.Thread.run(Thread.java:662)
>
>
>
> Maybe worthwhile to say that this behavior is the same regardless if the
> cluster is idle or loaded. Apart from that (and some infamous
> stop-the-world GC issues which I got to fix) the cluster is running
> fine.
>
>
>
> For reference: the zookeeper ensemble is properly terminating the
> session as we can see here:
>
>
>
> 2011-09-16 10:33:25,988 - INFO  [CommitProcessor:1:NIOServerCnxn@1580] -
> Established session 0x1324d1aa92a01bb with negotiated timeout 40000 for
> client /10.20.4.98:47238
>
> 2011-09-16 10:33:29,180 - INFO
> [ProcessThread:-1:PrepRequestProcessor@407] - Got user-level
> KeeperException when processing sessionid:0x1324d1aa92a01bb type:create
> cxid:0xd zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error
> Path:/hbase/rs/datanode03,60020,1316162005809 Error:KeeperErrorCode =
> NodeExists for /hbase/rs/datanode03,60020,1316162005809
>
> 2011-09-16 10:34:06,414 - INFO
> [ProcessThread:-1:PrepRequestProcessor@387] - Processed session
> termination for sessionid: 0x2324dad8d770170
>
> 2011-09-16 10:34:06,430 - INFO
> [ProcessThread:-1:PrepRequestProcessor@387] - Processed session
> termination for sessionid: 0x1324d1aa92a01bb
>
> 2011-09-16 10:34:06,438 - INFO
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed
> socket connection for client /10.20.4.98:47238 which had sessionid
> 0x1324d1aa92a01bb
>
>
>
> I can also confirm in the zk_dump found in the hbase master web UI that
> the zookeeper ensemble no longer has the session active yet the HBase
> master does not detect this. However, the hbase shell still reports that
> all servers are alive:
>
>
>
> hbase(main):001:0> status
>
> 3 servers, 0 dead, 96.3333 average load
>
>
>
> Maybe I am missing something obvious but I'm quite stumped on this. I
> found a thread on Google where J-D suggested the session timeout, but
> nothing happens if I let it run overnight (so that is 12 hours+). You
> can find it here:
> http://apache-hbase.679495.n3.nabble.com/Can-master-detect-sudden-region
> -server-death-td1141384.html
>
>
>
> The only way for the HBase master to detect that the regionserver is
> dead is by restarting the HBase master ... which is frankly not really
> what I want.
>
>
>
> Any pointers would be greatly appreciated.
>
>
>
> Thanks,
>
> Jorn
>
>

Mime
View raw message