hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew LeMieux <...@mlogiciels.com>
Subject Re: Master does not recognize region servers - cannot restart cluster
Date Sun, 03 Oct 2010 20:41:17 GMT
Thank you so much.  That solved the problem.  

I killed the master, I moved '/hbase/.logs' to another directory, and restarted the master.
 I am back up and running again.  Such a simple solution, I wish I knew it earlier.  This
one was a long outage. 

J-D, I do believe there is a bug underneath all this, because the master simply stopped processing
log files (see the 21 of 36 comment below).  There were no error/exception messages, it just
stopped processing logs, and did nothing further.   I'm guessing that since it never finished
processing the log files, it never paid attention to the region servers.  

I'm confident the region servers were connecting to master for 2 reasons:

  1) without touching the region servers, after performing the "surgery", everything came
back to live (almost instantly)
  2) when I killed the master, the region servers reported this: 

2010-10-03 20:29:38,730 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to
connect to master. Retrying. Error was:
java.io.IOException: Call to /10.104.37.247:60000 failed on local exception: java.io.IOException:
Connection reset by peer
        at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:781)
        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
        at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:252)
        at $Proxy0.getProtocolVersion(Unknown Source)
        at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:408)
        at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:384)
        at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:431)
        at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:342)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.getMaster(HRegionServer.java:1229)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:1246)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:445)
        at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcher.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
        at sun.nio.ch.IOUtil.read(IOUtil.java:206)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
        at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
        at java.io.FilterInputStream.read(FilterInputStream.java:116)
        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:279)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
        at java.io.DataInputStream.readInt(DataInputStream.java:370)
        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:514)
        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:454)


Thank you, 

Matthew


On Oct 3, 2010, at 1:25 PM, Jean-Daniel Cryans wrote:

> You should make sure that the region servers are actually trying to
> connect to the master, once the log splitting is done it should be
> instantaneous.
> 
> The logs (in a default setup) are in /hbase/.logs, just do a "hadoop
> dfs -rmr /hbase/.logs" before starting the master or move them aside
> (for future replay).
> 
> J-D
> 
> On Sun, Oct 3, 2010 at 1:22 PM, Matthew LeMieux <mdl@mlogiciels.com> wrote:
>> It is my understanding that the log splitting going on is a process of taking recent
updates and applying those changes to the appropriate regions.  Is there a way I can remove
those "logs" or recent edits safely, so that the master doesn't have to do any log splitting
on start up?
>> 
>> I am willing to lose a small amount of data just to be able to get back up and running.
   What kind of surgery would I need to do on the file system?
>> 
>> -Matthew
>> 
>> On Oct 3, 2010, at 12:55 PM, Matthew LeMieux wrote:
>> 
>>> I probably wasn't clear on this in the previous message.  There are NO exceptions
reported in the log file.  It appears as though the master is attempting to do the log splits
(see below 21 of 36 at 2010-10-03 18:13:28,187), and then immediately transitions into the
"Closed" messages, again with NO exceptions being reported - it just stops doing the log splitting.
 After the "Closed" messages start there are no further log splitting messages in the log
file.  It just does a series of those, and then moves on to reporting 0 region servers again.
 Again, NO other messages are being sent to the log file.
>>> 
>>> -Matthew
>>> 
>>> 
>>> 
>>> On Oct 3, 2010, at 12:33 PM, Matthew LeMieux wrote:
>>> 
>>>> It think it is a bit more serious than that.  Waiting doesn't appear to do
much.
>>>> 
>>>> After a series of log messages that appear to be doing the splitting, the
"Closed",  messages appear:
>>>> 
>>>> 2010-10-03 18:13:26,664 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering
filehdfs://domU-12-31-39-03-44-F1.compute-1.internal:9000/hbase/.logs/domU-12-31-39-18-12-05.compute-1.internal,60020,1284222361677/10.104.37.247%3A60020.1284820404595
>>>> 2010-10-03 18:13:26,701 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 18:13:26,872 INFO org.apache.hadoop.hbase.util.FSUtils: Finished
lease recover attempt for hdfs://domU-12-31-39-03-44-F1.compute-1.internal:9000/hbase/.logs/domU-12-31-39-18-12-05.compute-1.internal,60020,1284222361677/10.104.37.247%3A60020.128482040459
>>>> 5
>>>> 2010-10-03 18:13:27,500 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
Pushed=5211 entries from hdfs://domU-12-31-39-03-44-F1.compute-1.internal:9000/hbase/.logs/domU-12-31-39-18-12-05.compute-1.internal,60020,1284222361677/10.104.37.247%3A60020.128482040459
>>>> 5
>>>> 2010-10-03 18:13:27,500 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
Splitting hlog 20 of 36: hdfs://domU-12-31-39-03-44-F1.compute-1.internal:9000/hbase/.logs/domU-12-31-39-18-12-05.compute-1.internal,60020,1284222361677/10.104.37.247%3A60020.128482041398
>>>> 9, length=64550958
>>>> 2010-10-03 18:13:27,500 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering
filehdfs://domU-12-31-39-03-44-F1.compute-1.internal:9000/hbase/.logs/domU-12-31-39-18-12-05.compute-1.internal,60020,1284222361677/10.104.37.247%3A60020.1284820413989
>>>> 2010-10-03 18:13:27,721 INFO org.apache.hadoop.hbase.util.FSUtils: Finished
lease recover attempt for hdfs://domU-12-31-39-03-44-F1.compute-1.internal:9000/hbase/.logs/domU-12-31-39-18-12-05.compute-1.internal,60020,1284222361677/10.104.37.247%3A60020.128482041398
>>>> 9
>>>> 2010-10-03 18:13:28,187 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
Pushed=4555 entries from hdfs://domU-12-31-39-03-44-F1.compute-1.internal:9000/hbase/.logs/domU-12-31-39-18-12-05.compute-1.internal,60020,1284222361677/10.104.37.247%3A60020.128482041398
>>>> 9
>>>> 2010-10-03 18:13:28,187 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
Splitting hlog 21 of 36: hdfs://domU-12-31-39-03-44-F1.compute-1.internal:9000/hbase/.logs/domU-12-31-39-18-12-05.compute-1.internal,60020,1284222361677/10.104.37.247%3A60020.1284820433464,
length=2373620705
>>>> 2010-10-03 18:13:28,187 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering
filehdfs://domU-12-31-39-03-44-F1.compute-1.internal:9000/hbase/.logs/domU-12-31-39-18-12-05.compute-1.internal,60020,1284222361677/10.104.37.247%3A60020.1284820433464
>>>> 2010-10-03 18:13:28,391 INFO org.apache.hadoop.hbase.util.FSUtils: Finished
lease recover attempt for hdfs://domU-12-31-39-03-44-F1.compute-1.internal:9000/hbase/.logs/domU-12-31-39-18-12-05.compute-1.internal,60020,1284222361677/10.104.37.247%3A60020.1284820433464
>>>> 2010-10-03 18:13:42,141 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
Pushed=65522 entries from hdfs://domU-12-31-39-03-44-F1.compute-1.internal:9000/hbase/.logs/domU-12-31-39-18-12-05.compute-1.internal,60020,1284222361677/10.104.37.247%3A60020.1284820433464
>>>> 2010-10-03 18:13:42,533 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
Closed hdfs://domU-12-31-39-03-44-F1.compute-1.internal:9000/hbase/dirgraph/e2a8b3352d84523be7f43f966bde6ef1/recovered.edits/0000000000075378488
>>>> 2010-10-03 18:13:42,539 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
Closed hdfs://domU-12-31-39-03-44-F1.compute-1.internal:9000/hbase/dirgraph/e3a7a269aa9f1b32de6b3add53cc5a30/recovered.edits/0000000000075400136
>>>> 2010-10-03 18:13:42,546 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
Closed hdfs://domU-12-31-39-03-44-F1.compute-1.internal:9000/hbase/dirgraph/32d50d0d37a5ed3f89f17772678c4c20/recovered.edits/0000000000075400152
>>>> 2010-10-03 18:13:42,550 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
Closed hdfs://domU-12-31-39-03-44-F1.compute-1.internal:9000/hbase/dirgraph/182719f8ad5c2c7da56bdd3987a4559c/recovered.edits/0000000000075471128
>>>> 2010-10-03 18:13:42,555 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
Closed hdfs://domU-12-31-39-03-44-F1.compute-1.internal:9000/hbase/dirgraph/4148aa42cf9ac956c0e048b83ca6f186/recovered.edits/0000000000075463238
>>>> 2010-10-03 18:13:42,560 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
Closed hdfs://domU-12-31-39-03-44-F1.compute-1.internal:9000/hbase/dirgraph/9d4edffd841fb66c37ab5944f12ef911/recovered.edits/0000000000075393635
>>>> 2
>>>> 
>>>> Then the ONLY messages that are appear in the master's log file are these:
>>>> 
>>>> 2010-10-03 18:55:27,130 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 18:56:27,140 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 18:57:27,150 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 18:58:27,160 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 18:59:27,170 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 19:00:27,180 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 19:01:27,190 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 19:02:27,200 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 19:03:27,215 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 19:04:27,225 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 19:05:27,235 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 19:06:27,245 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 19:07:27,255 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 19:08:27,265 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 19:09:27,275 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 19:10:27,285 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 19:11:27,295 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 19:12:27,305 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 19:13:27,315 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 19:14:27,325 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 19:15:27,335 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 19:16:27,348 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 19:17:27,358 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 19:18:27,368 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 19:19:27,378 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 19:20:27,388 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 19:21:27,388 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 19:22:27,388 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 19:23:27,388 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 19:24:27,388 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 19:25:27,388 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 2010-10-03 19:26:27,388 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>> 
>>>> I'm not pulling any lines out, ALL regions servers are just waiting for the
master, with NO lines being written to the log file.  I tried killing the master, and restarting,
clearly the region servers are aware of the master, but the master persists in not recognizing
them:
>>>> 
>>>> 2010-10-03 18:11:35,888 INFO org.apache.hadoop.hbase.regionserver.HRegionServer:
Got ZooKeeper event, state: SyncConnected, type: NodeDeleted, path: /hbase/master
>>>> 2010-10-03 18:11:35,891 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper:
<10.249.70.255:/hbase,domU-12-31-39-18-1B-05.compute-1.internal,60020,1286127632413>Set
watcher on master address ZNode /hbase/master
>>>> 2010-10-03 18:12:26,697 INFO org.apache.hadoop.hbase.regionserver.HRegionServer:
Got ZooKeeper event, state: SyncConnected, type: NodeCreated, path: /hbase/master
>>>> 2010-10-03 18:12:26,698 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper:
<10.249.70.255:/hbase,domU-12-31-39-18-1B-05.compute-1.internal,60020,1286127632413>Read
ZNode /hbase/master got 10.104.37.247:60000
>>>> 2010-10-03 18:12:26,698 INFO org.apache.hadoop.hbase.regionserver.HRegionServer:
Telling master at 10.104.37.247:60000 that we are up
>>>> 
>>>> This is clearly a not a situation of just waiting it out.  A region server
killed itself, restarting the region server didn't get the cluster back up and running.  I
restarted everything from scratch, and that brings us to the current state.  The machines
are not doing anything, they have a load average of 0.00 and cpu idle of more than 99%.
>>>> 
>>>> What would cause the cluster to completely lock up and refuse to do anything?
>>>> 
>>>> Thank you for your help,
>>>> 
>>>> Matthew
>>>> 
>>>> On Oct 3, 2010, at 12:02 PM, Jean-Daniel Cryans wrote:
>>>> 
>>>>> From the brief looks of it, it seems that the master is splitting the
>>>>> log files from the dead region server. It will do that while the
>>>>> cluster is running and will keep answering the other region servers,
>>>>> but if you restart HBase then when the master starts it will split
>>>>> everything before starting to take region server checkins. Just let
>>>>> the master finish it's job. Look for this message that tells you which
>>>>> region server's hlogs are being split:
>>>>> 
>>>>> LOG.info("Splitting " + logfiles.length + " hlog(s) in " + srcDir.toString());
>>>>> 
>>>>> Then this message will show when it's done:
>>>>> 
>>>>> LOG.info("hlog file splitting completed in " + (endMillis - millis) +
>>>>> " millis for " + srcDir.toString());
>>>>> 
>>>>> J-D
>>>>> 
>>>>> On Sun, Oct 3, 2010 at 10:56 AM, Matthew LeMieux <mdl@mlogiciels.com>
wrote:
>>>>>> I've recently had a region server suicide, and am not able to recover
from it.  I've tried completely stopping the entire cluster and restarting it (including dfs
and zk), but the master refuses to recognize the regionservers.
>>>>>> 
>>>>>> The region servers appear to just be waiting for the master with
this in their log file:
>>>>>> 
>>>>>> 2010-10-03 17:40:32,748 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper:
<10.249.70.255:/hbase,domU-12-31-39-18-1B-05.compute-1.internal,60020,1286127632413>Read
ZNode /hbase/master got 10.104.37.247:60000
>>>>>> 2010-10-03 17:40:32,749 INFO org.apache.hadoop.hbase.regionserver.HRegionServer:
Telling master at 10.104.37.247:60000 that we are up
>>>>>> 2010-10-03 17:40:32,862 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook:
Installed shutdown hook thread: Shutdownhook:regionserver60020
>>>>>> 
>>>>>> ... and the the master log file just keeps repeating this:
>>>>>> 
>>>>>> 2010-10-03 17:42:15,531 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>>>> 2010-10-03 17:43:15,541 INFO org.apache.hadoop.hbase.master.ServerManager:
0 region servers, 0 dead, average load NaN
>>>>>> 
>>>>>> After many lines of this sort of thing:
>>>>>> 
>>>>>> 2010-10-03 17:41:05,179 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
Split writer thread for region user,\x01\x88\xFB\xCA,1281914437530.3901f9eb92c049a295aeec3a7e739fe2.
got 11 to process
>>>>>> 2010-10-03 17:41:05,180 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
Split writer thread for region user,\x01\x88\xFB\xCA,1281914437530.3901f9eb92c049a295aeec3a7e739fe2.
Applied 11 total edits to user,\x01\x88\xFB\xCA,1281914437530.3901f9eb92c049a295ae
>>>>>> 
>>>>>> Followed by many lines of this:
>>>>>> 
>>>>>> 2010-10-03 17:41:24,719 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
Closed hdfs://domU-12-31-39-03-44-F1.compute-1.internal:9000/hbase/user/7b49d357be708d07e6f01843a35286a7/recovered.edits/0000000000075377494
>>>>>> 2010-10-03 17:41:24,724 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
Closed hdfs://domU-12-31-39-03-44-F1.compute-1.internal:9000/hbase/user/3a58b7adcf049800be83425e75288eeb/recovered.edits/0000000000075377495
>>>>>> 
>>>>>> As one might expect, attempts to achbase hang, for example:
>>>>>> 
>>>>>> HBase Shell; enter 'help<RETURN>' for list of supported commands.
>>>>>> Type "exit<RETURN>" to leave the HBase Shell
>>>>>> Version: 0.89.20100924, r1001068, Fri Sep 24 13:55:42 PDT 2010
>>>>>> 
>>>>>> hbase(main):001:0> list
>>>>>> TABLE
>>>>>> 
>>>>>> 
>>>>>> I'm using CDH3b2 for hdfs and the version of hbase from here:  http://people.apache.org/~jdcryans/hbase-0.89.20100924-candidate-1
>>>>>> 
>>>>>> Any ideas on how I can get the master to recognize the region servers?
 I'm really just concerned with how to get back up and running.
>>>>>> 
>>>>>> Thank you
>>>>>> 
>>>>>> Matthew
>>>>>> 
>>>>>> 
>>>> 
>>> 
>> 
>> 


Mime
View raw message