hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Einspanjer <deinspan...@mozilla.com>
Subject Re: Region servers up and running, but Master reports 0
Date Tue, 24 Aug 2010 03:30:25 GMT
Matthew,

Maybe instead of changing the replication factor, you could spin up new nodes with a different
datacenter/rack configuration which would cause hadoop to ensure the replicas are not solely
on those temp nodes?

Matthew LeMieux <mdl@mlogiciels.com> wrote:

J-D, 

     Thank you for the very fast response.    I'm running in the Amazon cloud, and would like
to be able to expand the cluster and reduce the cluster quickly to support hadoop jobs, so
I've been playing with the replication factor and number of machines.  

      If I add 5 more machines for a job I'm running over night that needs some extra power,
then removing those machines the next morning becomes a tedious task because some of the data
generated resides perhaps exclusively on the 5 new machines.  Since the HDFS node decommission
process takes so long, I've been putting the replication factor high, and removing machines
incrementally.   For the most part, it works great because the replication speed is much faster
than the decommission speed.  Hadoop M/R jobs and HDFS seem completely content to deal with
nodes coming and going as long as there are at least 1 replica from which to replicate.  
Since the Hadoop M/R jobs pull data from HBase, do some processing, and then put the data
back into HBase I assumed that hosting the Hadoop M/R and HBase on the same HDFS would be
faster than copying the data to a new transient cluster.  

     Is there a way to tell HBase that I want the replication factor to be high, but not to
panic (i.e., go crazy with logs) if it isn't actually at that number?  Sort of like a soft
limit and a hard limit...

Thank you,

Matthew

P.S.  I do need the data, I wish there was a way to speed this up.   I didn't find any evidence
of a crash, just a looooong recovery.


On Aug 23, 2010, at 4:30 PM, Jean-Daniel Cryans wrote:

> Matthew,
> 
> Yes, the master won't start until the log splitting is done. By the
> looks of it (such a high number of logs when the max is 32 until the
> region server force flushes hlogs), it seems that your region servers
> in a previous run weren't able to ensure that each HLog had 3 replicas
> even tho dfs.replication was set to that number. Check your previous
> region server logs, you should see instances of:
> 
>        LOG.warn("HDFS pipeline error detected. " +
>            "Found " + numCurrentReplicas + " replicas but expecting " +
>            this.initialReplication + " replicas. " +
>            " Requesting close of hlog.");
> 
> And that probably made 1 log per insert. If your still testing and
> your data isn't important, feel free to kill -9 the master and nuke
> the HBase HDFS folder or at least the .logs folder. If not, you'll
> have to wait. Once it's done, make sure you have a proper
> configuration before any other testing.
> 
> J-D
> 
> On Mon, Aug 23, 2010 at 4:22 PM, Matthew LeMieux <mdl@mlogiciels.com> wrote:
>> I have a cluster of 3 machines where the NameNode is separate from the HMaster based
on the distribution from Cloudera (CDH3).   I have been running it successfully for a couple
weeks.   As of this morning, it is completely unusable.  I'm looking for some help on how
to fix it.  Details are below.  Thank you.
>> 
>> This morning I found HBase to be unresponsive, and tried restarting it.  That didn't
help.  For example, running "hbase shell", and then "list" hangs.
>> 
>> The master and region processes start up, but the master does not recognize that
the region servers are there.  I am getting the following in master's log file:
>> 
>> 2010-08-23 23:04:16,100 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region
servers, 0 dead, average load NaN
>> 2010-08-23 23:05:16,110 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region
servers, 0 dead, average load NaN
>> 2010-08-23 23:06:16,120 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region
servers, 0 dead, average load NaN
>> 2010-08-23 23:07:16,130 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region
servers, 0 dead, average load NaN
>> 2010-08-23 23:08:16,140 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region
servers, 0 dead, average load NaN
>> 2010-08-23 23:09:16,146 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region
servers, 0 dead, average load NaN
>> 2010-08-23 23:10:16,150 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region
servers, 0 dead, average load NaN
>> 2010-08-23 23:11:16,160 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region
servers, 0 dead, average load NaN
>> 2010-08-23 23:12:16,170 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region
servers, 0 dead, average load NaN
>> 
>> 
>> Meanwhile, the region servers show this in their log files:
>> 
>> 2010-08-23 23:05:21,006 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection
to server zookeeper:2181
>> 2010-08-23 23:05:21,028 INFO org.apache.zookeeper.ClientCnxn: Socket connection established
to zookeeper:2181, initiating session
>> 2010-08-23 23:05:21,168 INFO org.apache.zookeeper.ClientCnxn: Session establishment
complete on server zookeeper:2181, sessionid = 0x12aa0cc2520000e, negotiated timeout = 40000
>> 2010-08-23 23:05:21,172 INFO org.apache.hadoop.hbase.regionserver.HRegionServer:
Got ZooKeeper event, state: SyncConnected, type: None, path: null
>> 2010-08-23 23:05:21,177 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper:
Set watcher on master address ZNode /hbase/master
>> 2010-08-23 23:05:21,421 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper:
Read ZNode /hbase/master got master:60000
>> 2010-08-23 23:05:21,421 INFO org.apache.hadoop.hbase.regionserver.HRegionServer:
Telling master at master:60000 that we are up
>> 2010-08-23 23:05:22,056 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed
shutdown hook thread: Shutdownhook:regionserver60020
>> 
>> The Region server process is obviously waiting on something:
>> 
>> /tmp/hbaselog$ sudo strace -p7592
>> Process 7592 attached - interrupt to quit
>> futex(0x7f65534739e0, FUTEX_WAIT, 7602, NULL
>> 
>> The Master isn't idle, it appears to be trying to do some sort of recovery having
woken up to find 0 region servers.  Here is an excerpt from it:
>> 
>> 2010-08-23 23:10:06,290 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Splitting
hlog 12142 of 143261: hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704435,
length=1150
>> 2010-08-23 23:10:06,290 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering filehdfs://namenode:9000/hbase/.log
master,60020,1282577331142/master%3A60020.1282581704435
>> 2010-08-23 23:10:06,510 INFO org.apache.hadoop.hbase.util.FSUtils: Finished lease
recover attempt for hdfs://namenode:9000/hbase/.logs master,60020,1282577331142/master%3A60020.1282581704435
>> 2010-08-23 23:10:06,513 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Pushed=3
entries from hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/1 master%3A60020.1282581704435
>> 2010-08-23 23:10:06,513 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Splitting
hlog 12143 of 143261: hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451,
length=448
>> 2010-08-23 23:10:06,513 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering filehdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451
>> 2010-08-23 23:10:06,721 INFO org.apache.hadoop.hbase.util.FSUtils: Finished lease
recover attempt for hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451
>> 2010-08-23 23:10:06,723 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Pushed=2
entries from hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451
>> 2010-08-23 23:10:06,723 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Splitting
hlog 12144 of 143261: hdfs:/namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704468,
length=582
>> 2010-08-23 23:10:06,723 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering filehdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704468
>> 
>> 
>> It looks like the Master is sequentially going through logs up to 143261, having
started at 1 and is currently at 12144.   At the current rate, it will take around 12 hours
to complete.  Do I have to wait for it to complete before the master will recognize the region
servers?  If it doesn't have any region servers, then what the heck is the master doing anyway?
>> 
>> Thank you for your help,
>> 
>> Matthew
>> 
>> 
>> 


Mime
View raw message