hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Esteban Gutierrez <este...@cloudera.com>
Subject Re: Slow waitForAckedSeqno took too long time
Date Mon, 08 Dec 2014 19:02:09 GMT
Hello Louis,

Just looking into the timeouts and the services you are running in this
cluster I think the issue could be caused by memory oversubscription or
some sort of network issue that triggers a failover in the NN making Region
Servers hang until HDFS is available. Make sure that the heap sizes for
each process don't max out the amount of physical memory you have in your
cluster or that you don't have any other network issue. Also, since your
cluster is very small probably you should consider not to use HA for HDFS
at this point thats only consuming additional resources and you won't get
any real advantage in 3 node cluster. Are you using by any chance VMs? or
are those 3 nodes physical boxes?

cheers,
esteban.


--
Cloudera, Inc.


On Fri, Dec 5, 2014 at 8:52 PM, louis.hust <louis.hust@gmail.com> wrote:

> Hi Ted,
>
> Thanks for your reply.
>
> Actually I shutdown the l-hbase1.dev.dba.cn0 at the time 11:47:51 and the
> hbase recover at 12:03
>
> the below log is the zkfc log on l-hbase2
>
> {log}
> 2014-12-05 11:18:47,077 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Already in election. Not re-connecting.
> 2014-12-05 11:48:16,023 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Checking for any old active which needs to be fenced...
> 2014-12-05 11:48:16,025 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Old node exists:
> 0a096d79636c757374657212036e6e311a146c2d6862617365312e6462612e6465762e636e3020d43e28d33e
> 2014-12-05 11:48:16,028 INFO org.apache.hadoop.ha.ZKFailoverController:
> Should fence: NameNode at l-hbase1.dba.dev.cn0/10.86.36.217:8020
> 2014-12-05 11:48:32,031 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried
> 0 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000
> MILLISECONDS)
> 2014-12-05 11:48:47,033 WARN org.apache.hadoop.ha.FailoverController:
> Unable to gracefully make NameNode at l-hbase1.dba.dev.cn0/
> 10.86.36.217:8020 standby (unable to connect)
> java.net.ConnectException: Call From l-hbase2.dba.dev.cn0.
> qunar.com/10.86.36.218 to l-hbase1.dba.dev.cn0:8020 failed on connection
> exception: java.net.ConnectException: Connection timed out; For more
> details see:  http://wiki.apache.org/hadoop/ConnectionRefused
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
> at org.apache.hadoop.ipc.Client.call(Client.java:1413)
> at org.apache.hadoop.ipc.Client.call(Client.java:1362)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> at com.sun.proxy.$Proxy8.transitionToStandby(Unknown Source)
> at
> org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112)
> at
> org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
> at
> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:503)
> at
> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
> at
> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
> at
> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
> at
> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:901)
> at
> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:800)
> at
> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
> at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> Caused by: java.net.ConnectException: Connection timed out
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
> at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
> at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
> at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1461)
> at org.apache.hadoop.ipc.Client.call(Client.java:1380)
> ... 14 more
> 2014-12-05 11:48:47,034 INFO org.apache.hadoop.ha.NodeFencer: ======
> Beginning Service Fencing Process... ======
> 2014-12-05 11:48:47,034 INFO org.apache.hadoop.ha.NodeFencer: Trying
> method 1/1: org.apache.hadoop.ha.ShellCommandFencer(/bin/true)
> 2014-12-05 11:48:47,040 INFO org.apache.hadoop.ha.ShellCommandFencer:
> Launched fencing command '/bin/true' with pid 24513
> 2014-12-05 11:48:47,044 INFO org.apache.hadoop.ha.NodeFencer: ======
> Fencing successful by method
> org.apache.hadoop.ha.ShellCommandFencer(/bin/true) ======
> 2014-12-05 11:48:47,044 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Writing znode /hadoop-ha/mycluster/ActiveBreadCrumb to indicate that the
> local node is the most recent active...
> 2014-12-05 11:48:47,063 INFO org.apache.hadoop.ha.ZKFailoverController:
> Trying to make NameNode at l-hbase2.dba.dev.cn0/10.86.36.218:8020
> active...
> 2014-12-05 11:48:47,284 INFO org.apache.hadoop.ha.ZKFailoverController:
> Successfully transitioned NameNode at l-hbase2.dba.dev.cn0/
> 10.86.36.218:8020 to active state
> 2014-12-05 12:29:56,906 INFO
> org.apache.hadoop.hdfs.tools.DFSZKFailoverController: Allowed RPC access
> from hadoop (auth:SIMPLE) at 10.86.36.217
> 2014-12-05 12:29:56,907 INFO org.apache.hadoop.ha.ZKFailoverController:
> Requested by hadoop (auth:SIMPLE) at 10.86.36.217 to cede active role.
> 2014-12-05 12:29:56,943 INFO org.apache.hadoop.ha.ZKFailoverController:
> Successfully ensured local node is in standby mode
> 2014-12-05 12:29:56,943 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Yielding from election
> 2014-12-05 12:29:56,943 INFO org.apache.hadoop.ha.ActiveStandbyElector:
> Deleting bread-crumb of active node…
>
> {log}
>
>
> the name node log is a bit large, so attach it.
>
>
> On Dec 5, 2014, at 23:01, Ted Yu <yuzhihong@gmail.com> wrote:
>
> The warning was logged by DFSOutputStream.
>
> What was the load on hdfs around 2014-12-05 12:03 ?
> Have you checked namenode log ?
>
> Cheers
>
> On Thu, Dec 4, 2014 at 9:01 PM, mail list <louis.hust.ml@gmail.com> wrote:
>
> Hi ,all
>
> I deploy Hbase0.98.6-cdh5.2.0 on 3 machine:
>
> l-hbase1.dev.dba.cn0(hadoop namenode active, HMaster active)
> l-hbase2.dev.dba.cn0(hadoop namenode standby, HMaster standby, hadoop
> datanode)
> l-hbase3.dev.dba.cn0(regionserver, hadoop datanode)
>
> Then I shutdown the l-hbase1.dev.dba.cn0,  But HBase can not work until
> about 15mins later.
> I check the log and find the following log in the region server’s log:
>
> 2014-12-05 12:03:19,169 WARN  [regionserver60020-WAL.AsyncSyncer0]
> hdfs.DFSClient: Slow waitForAckedSeqno took 927762ms (threshold=30000ms)
> 2014-12-05 12:03:19,186 INFO  [regionserver60020-WAL.AsyncSyncer0]
> wal.FSHLog: Slow sync cost: 927779 ms, current pipeline: [
> 10.86.36.219:50010]
> 2014-12-05 12:03:19,186 DEBUG [regionserver60020.logRoller]
> regionserver.LogRoller: HLog roll requested
> 2014-12-05 12:03:19,236 WARN  [regionserver60020-WAL.AsyncSyncer1]
> hdfs.DFSClient: Slow waitForAckedSeqno took 867706ms (threshold=30000ms)
>
> It seems the WAL Asysnc took too long time for region server recovery? I
> don’t know if the log matters ?
> Can any body explain the reason? and how to reduce the time for recovery?
>
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message