hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "louis.hust" <louis.h...@gmail.com>
Subject Re: Slow waitForAckedSeqno took too long time
Date Sat, 06 Dec 2014 04:52:16 GMT
Hi Ted,

Thanks for your reply.

Actually I shutdown the l-hbase1.dev.dba.cn0 at the time 11:47:51 and the hbase recover at
12:03

the below log is the zkfc log on l-hbase2

{log}
2014-12-05 11:18:47,077 INFO org.apache.hadoop.ha.ActiveStandbyElector: Already in election.
Not re-connecting.
2014-12-05 11:48:16,023 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old
active which needs to be fenced...
2014-12-05 11:48:16,025 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old node exists: 0a096d79636c757374657212036e6e311a146c2d6862617365312e6462612e6465762e636e3020d43e28d33e
2014-12-05 11:48:16,028 INFO org.apache.hadoop.ha.ZKFailoverController: Should fence: NameNode
at l-hbase1.dba.dev.cn0/10.86.36.217:8020
2014-12-05 11:48:32,031 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020.
Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1,
sleepTime=1000 MILLISECONDS)
2014-12-05 11:48:47,033 WARN org.apache.hadoop.ha.FailoverController: Unable to gracefully
make NameNode at l-hbase1.dba.dev.cn0/10.86.36.217:8020 standby (unable to connect)
java.net.ConnectException: Call From l-hbase2.dba.dev.cn0.qunar.com/10.86.36.218 to l-hbase1.dba.dev.cn0:8020
failed on connection exception: java.net.ConnectException: Connection timed out; For more
details see:  http://wiki.apache.org/hadoop/ConnectionRefused
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
	at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
	at org.apache.hadoop.ipc.Client.call(Client.java:1413)
	at org.apache.hadoop.ipc.Client.call(Client.java:1362)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
	at com.sun.proxy.$Proxy8.transitionToStandby(Unknown Source)
	at org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112)
	at org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
	at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:503)
	at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
	at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
	at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
	at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:901)
	at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:800)
	at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: java.net.ConnectException: Connection timed out
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
	at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
	at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1461)
	at org.apache.hadoop.ipc.Client.call(Client.java:1380)
	... 14 more
2014-12-05 11:48:47,034 INFO org.apache.hadoop.ha.NodeFencer: ====== Beginning Service Fencing
Process... ======
2014-12-05 11:48:47,034 INFO org.apache.hadoop.ha.NodeFencer: Trying method 1/1: org.apache.hadoop.ha.ShellCommandFencer(/bin/true)
2014-12-05 11:48:47,040 INFO org.apache.hadoop.ha.ShellCommandFencer: Launched fencing command
'/bin/true' with pid 24513
2014-12-05 11:48:47,044 INFO org.apache.hadoop.ha.NodeFencer: ====== Fencing successful by
method org.apache.hadoop.ha.ShellCommandFencer(/bin/true) ======
2014-12-05 11:48:47,044 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing znode /hadoop-ha/mycluster/ActiveBreadCrumb
to indicate that the local node is the most recent active...
2014-12-05 11:48:47,063 INFO org.apache.hadoop.ha.ZKFailoverController: Trying to make NameNode
at l-hbase2.dba.dev.cn0/10.86.36.218:8020 active...
2014-12-05 11:48:47,284 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned
NameNode at l-hbase2.dba.dev.cn0/10.86.36.218:8020 to active state
2014-12-05 12:29:56,906 INFO org.apache.hadoop.hdfs.tools.DFSZKFailoverController: Allowed
RPC access from hadoop (auth:SIMPLE) at 10.86.36.217
2014-12-05 12:29:56,907 INFO org.apache.hadoop.ha.ZKFailoverController: Requested by hadoop
(auth:SIMPLE) at 10.86.36.217 to cede active role.
2014-12-05 12:29:56,943 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully ensured
local node is in standby mode
2014-12-05 12:29:56,943 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election
2014-12-05 12:29:56,943 INFO org.apache.hadoop.ha.ActiveStandbyElector: Deleting bread-crumb
of active node…

{log}


the name node log is a bit large, so attach it.


On Dec 5, 2014, at 23:01, Ted Yu <yuzhihong@gmail.com> wrote:

> The warning was logged by DFSOutputStream.
> 
> What was the load on hdfs around 2014-12-05 12:03 ?
> Have you checked namenode log ?
> 
> Cheers
> 
> On Thu, Dec 4, 2014 at 9:01 PM, mail list <louis.hust.ml@gmail.com> wrote:
> 
>> Hi ,all
>> 
>> I deploy Hbase0.98.6-cdh5.2.0 on 3 machine:
>> 
>> l-hbase1.dev.dba.cn0(hadoop namenode active, HMaster active)
>> l-hbase2.dev.dba.cn0(hadoop namenode standby, HMaster standby, hadoop
>> datanode)
>> l-hbase3.dev.dba.cn0(regionserver, hadoop datanode)
>> 
>> Then I shutdown the l-hbase1.dev.dba.cn0,  But HBase can not work until
>> about 15mins later.
>> I check the log and find the following log in the region server’s log:
>> 
>> 2014-12-05 12:03:19,169 WARN  [regionserver60020-WAL.AsyncSyncer0]
>> hdfs.DFSClient: Slow waitForAckedSeqno took 927762ms (threshold=30000ms)
>> 2014-12-05 12:03:19,186 INFO  [regionserver60020-WAL.AsyncSyncer0]
>> wal.FSHLog: Slow sync cost: 927779 ms, current pipeline: [
>> 10.86.36.219:50010]
>> 2014-12-05 12:03:19,186 DEBUG [regionserver60020.logRoller]
>> regionserver.LogRoller: HLog roll requested
>> 2014-12-05 12:03:19,236 WARN  [regionserver60020-WAL.AsyncSyncer1]
>> hdfs.DFSClient: Slow waitForAckedSeqno took 867706ms (threshold=30000ms)
>> 
>> It seems the WAL Asysnc took too long time for region server recovery? I
>> don’t know if the log matters ?
>> Can any body explain the reason? and how to reduce the time for recovery?
>> 
>> 
>> 


Mime
View raw message