hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew LeMieux <...@mlogiciels.com>
Subject Re: Region servers up and running, but Master reports 0
Date Tue, 24 Aug 2010 14:25:57 GMT
That is a very good point.  I had only been considering the data locality features of rack
configuration, not the fault tolerance aspect.  I'll give that a try.  Thank you. 

-Matthew

On Aug 23, 2010, at 8:30 PM, Daniel Einspanjer wrote:

> Matthew,
> 
> Maybe instead of changing the replication factor, you could spin up new nodes with a
different datacenter/rack configuration which would cause hadoop to ensure the replicas are
not solely on those temp nodes?
> 
> Matthew LeMieux <mdl@mlogiciels.com> wrote:
> 
> J-D, 
> 
>     Thank you for the very fast response.    I'm running in the Amazon cloud, and would
like to be able to expand the cluster and reduce the cluster quickly to support hadoop jobs,
so I've been playing with the replication factor and number of machines.  
> 
>      If I add 5 more machines for a job I'm running over night that needs some extra
power, then removing those machines the next morning becomes a tedious task because some of
the data generated resides perhaps exclusively on the 5 new machines.  Since the HDFS node
decommission process takes so long, I've been putting the replication factor high, and removing
machines incrementally.   For the most part, it works great because the replication speed
is much faster than the decommission speed.  Hadoop M/R jobs and HDFS seem completely content
to deal with nodes coming and going as long as there are at least 1 replica from which to
replicate.   Since the Hadoop M/R jobs pull data from HBase, do some processing, and then
put the data back into HBase I assumed that hosting the Hadoop M/R and HBase on the same HDFS
would be faster than copying the data to a new transient cluster.  
> 
>     Is there a way to tell HBase that I want the replication factor to be high, but not
to panic (i.e., go crazy with logs) if it isn't actually at that number?  Sort of like a soft
limit and a hard limit...
> 
> Thank you,
> 
> Matthew
> 
> P.S.  I do need the data, I wish there was a way to speed this up.   I didn't find any
evidence of a crash, just a looooong recovery.
> 
> 
> On Aug 23, 2010, at 4:30 PM, Jean-Daniel Cryans wrote:
> 
>> Matthew,
>> 
>> Yes, the master won't start until the log splitting is done. By the
>> looks of it (such a high number of logs when the max is 32 until the
>> region server force flushes hlogs), it seems that your region servers
>> in a previous run weren't able to ensure that each HLog had 3 replicas
>> even tho dfs.replication was set to that number. Check your previous
>> region server logs, you should see instances of:
>> 
>>       LOG.warn("HDFS pipeline error detected. " +
>>           "Found " + numCurrentReplicas + " replicas but expecting " +
>>           this.initialReplication + " replicas. " +
>>           " Requesting close of hlog.");
>> 
>> And that probably made 1 log per insert. If your still testing and
>> your data isn't important, feel free to kill -9 the master and nuke
>> the HBase HDFS folder or at least the .logs folder. If not, you'll
>> have to wait. Once it's done, make sure you have a proper
>> configuration before any other testing.
>> 
>> J-D
>> 
>> On Mon, Aug 23, 2010 at 4:22 PM, Matthew LeMieux <mdl@mlogiciels.com> wrote:
>>> I have a cluster of 3 machines where the NameNode is separate from the HMaster
based on the distribution from Cloudera (CDH3).   I have been running it successfully for
a couple weeks.   As of this morning, it is completely unusable.  I'm looking for some help
on how to fix it.  Details are below.  Thank you.
>>> 
>>> This morning I found HBase to be unresponsive, and tried restarting it.  That
didn't help.  For example, running "hbase shell", and then "list" hangs.
>>> 
>>> The master and region processes start up, but the master does not recognize that
the region servers are there.  I am getting the following in master's log file:
>>> 
>>> 2010-08-23 23:04:16,100 INFO org.apache.hadoop.hbase.master.ServerManager: 0
region servers, 0 dead, average load NaN
>>> 2010-08-23 23:05:16,110 INFO org.apache.hadoop.hbase.master.ServerManager: 0
region servers, 0 dead, average load NaN
>>> 2010-08-23 23:06:16,120 INFO org.apache.hadoop.hbase.master.ServerManager: 0
region servers, 0 dead, average load NaN
>>> 2010-08-23 23:07:16,130 INFO org.apache.hadoop.hbase.master.ServerManager: 0
region servers, 0 dead, average load NaN
>>> 2010-08-23 23:08:16,140 INFO org.apache.hadoop.hbase.master.ServerManager: 0
region servers, 0 dead, average load NaN
>>> 2010-08-23 23:09:16,146 INFO org.apache.hadoop.hbase.master.ServerManager: 0
region servers, 0 dead, average load NaN
>>> 2010-08-23 23:10:16,150 INFO org.apache.hadoop.hbase.master.ServerManager: 0
region servers, 0 dead, average load NaN
>>> 2010-08-23 23:11:16,160 INFO org.apache.hadoop.hbase.master.ServerManager: 0
region servers, 0 dead, average load NaN
>>> 2010-08-23 23:12:16,170 INFO org.apache.hadoop.hbase.master.ServerManager: 0
region servers, 0 dead, average load NaN
>>> 
>>> 
>>> Meanwhile, the region servers show this in their log files:
>>> 
>>> 2010-08-23 23:05:21,006 INFO org.apache.zookeeper.ClientCnxn: Opening socket
connection to server zookeeper:2181
>>> 2010-08-23 23:05:21,028 INFO org.apache.zookeeper.ClientCnxn: Socket connection
established to zookeeper:2181, initiating session
>>> 2010-08-23 23:05:21,168 INFO org.apache.zookeeper.ClientCnxn: Session establishment
complete on server zookeeper:2181, sessionid = 0x12aa0cc2520000e, negotiated timeout = 40000
>>> 2010-08-23 23:05:21,172 INFO org.apache.hadoop.hbase.regionserver.HRegionServer:
Got ZooKeeper event, state: SyncConnected, type: None, path: null
>>> 2010-08-23 23:05:21,177 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper:
Set watcher on master address ZNode /hbase/master
>>> 2010-08-23 23:05:21,421 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper:
Read ZNode /hbase/master got master:60000
>>> 2010-08-23 23:05:21,421 INFO org.apache.hadoop.hbase.regionserver.HRegionServer:
Telling master at master:60000 that we are up
>>> 2010-08-23 23:05:22,056 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook:
Installed shutdown hook thread: Shutdownhook:regionserver60020
>>> 
>>> The Region server process is obviously waiting on something:
>>> 
>>> /tmp/hbaselog$ sudo strace -p7592
>>> Process 7592 attached - interrupt to quit
>>> futex(0x7f65534739e0, FUTEX_WAIT, 7602, NULL
>>> 
>>> The Master isn't idle, it appears to be trying to do some sort of recovery having
woken up to find 0 region servers.  Here is an excerpt from it:
>>> 
>>> 2010-08-23 23:10:06,290 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
Splitting hlog 12142 of 143261: hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704435,
length=1150
>>> 2010-08-23 23:10:06,290 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering
filehdfs://namenode:9000/hbase/.log master,60020,1282577331142/master%3A60020.1282581704435
>>> 2010-08-23 23:10:06,510 INFO org.apache.hadoop.hbase.util.FSUtils: Finished lease
recover attempt for hdfs://namenode:9000/hbase/.logs master,60020,1282577331142/master%3A60020.1282581704435
>>> 2010-08-23 23:10:06,513 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
Pushed=3 entries from hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/1 master%3A60020.1282581704435
>>> 2010-08-23 23:10:06,513 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
Splitting hlog 12143 of 143261: hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451,
length=448
>>> 2010-08-23 23:10:06,513 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering
filehdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451
>>> 2010-08-23 23:10:06,721 INFO org.apache.hadoop.hbase.util.FSUtils: Finished lease
recover attempt for hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451
>>> 2010-08-23 23:10:06,723 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
Pushed=2 entries from hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451
>>> 2010-08-23 23:10:06,723 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
Splitting hlog 12144 of 143261: hdfs:/namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704468,
length=582
>>> 2010-08-23 23:10:06,723 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering
filehdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704468
>>> 
>>> 
>>> It looks like the Master is sequentially going through logs up to 143261, having
started at 1 and is currently at 12144.   At the current rate, it will take around 12 hours
to complete.  Do I have to wait for it to complete before the master will recognize the region
servers?  If it doesn't have any region servers, then what the heck is the master doing anyway?
>>> 
>>> Thank you for your help,
>>> 
>>> Matthew
>>> 
>>> 
>>> 
> 


Mime
View raw message