hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Master does not recognize region servers - cannot restart cluster
Date Sun, 03 Oct 2010 19:02:27 GMT
>From the brief looks of it, it seems that the master is splitting the
log files from the dead region server. It will do that while the
cluster is running and will keep answering the other region servers,
but if you restart HBase then when the master starts it will split
everything before starting to take region server checkins. Just let
the master finish it's job. Look for this message that tells you which
region server's hlogs are being split:

LOG.info("Splitting " + logfiles.length + " hlog(s) in " + srcDir.toString());

Then this message will show when it's done:

LOG.info("hlog file splitting completed in " + (endMillis - millis) +
" millis for " + srcDir.toString());

J-D

On Sun, Oct 3, 2010 at 10:56 AM, Matthew LeMieux <mdl@mlogiciels.com> wrote:
> I've recently had a region server suicide, and am not able to recover from it.  I've
tried completely stopping the entire cluster and restarting it (including dfs and zk), but
the master refuses to recognize the regionservers.
>
> The region servers appear to just be waiting for the master with this in their log file:
>
> 2010-10-03 17:40:32,748 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <10.249.70.255:/hbase,domU-12-31-39-18-1B-05.compute-1.internal,60020,1286127632413>Read
ZNode /hbase/master got 10.104.37.247:60000
> 2010-10-03 17:40:32,749 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Telling
master at 10.104.37.247:60000 that we are up
> 2010-10-03 17:40:32,862 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed
shutdown hook thread: Shutdownhook:regionserver60020
>
> ... and the the master log file just keeps repeating this:
>
> 2010-10-03 17:42:15,531 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region servers,
0 dead, average load NaN
> 2010-10-03 17:43:15,541 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region servers,
0 dead, average load NaN
>
> After many lines of this sort of thing:
>
> 2010-10-03 17:41:05,179 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Split writer
thread for region user,\x01\x88\xFB\xCA,1281914437530.3901f9eb92c049a295aeec3a7e739fe2. got
11 to process
> 2010-10-03 17:41:05,180 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Split writer
thread for region user,\x01\x88\xFB\xCA,1281914437530.3901f9eb92c049a295aeec3a7e739fe2. Applied
11 total edits to user,\x01\x88\xFB\xCA,1281914437530.3901f9eb92c049a295ae
>
> Followed by many lines of this:
>
> 2010-10-03 17:41:24,719 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Closed hdfs://domU-12-31-39-03-44-F1.compute-1.internal:9000/hbase/user/7b49d357be708d07e6f01843a35286a7/recovered.edits/0000000000075377494
> 2010-10-03 17:41:24,724 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Closed hdfs://domU-12-31-39-03-44-F1.compute-1.internal:9000/hbase/user/3a58b7adcf049800be83425e75288eeb/recovered.edits/0000000000075377495
>
> As one might expect, attempts to achbase hang, for example:
>
> HBase Shell; enter 'help<RETURN>' for list of supported commands.
> Type "exit<RETURN>" to leave the HBase Shell
> Version: 0.89.20100924, r1001068, Fri Sep 24 13:55:42 PDT 2010
>
> hbase(main):001:0> list
> TABLE
>
>
> I'm using CDH3b2 for hdfs and the version of hbase from here:  http://people.apache.org/~jdcryans/hbase-0.89.20100924-candidate-1
>
> Any ideas on how I can get the master to recognize the region servers?  I'm really just
concerned with how to get back up and running.
>
> Thank you
>
> Matthew
>
>

Mime
View raw message