hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Esteban Gutierrez <este...@cloudera.com>
Subject Re: Slow waitForAckedSeqno took too long time
Date Tue, 09 Dec 2014 01:25:08 GMT
Hi Louis,

That makes sense too.

Thanks for keeping us posted!



--
Cloudera, Inc.


On Mon, Dec 8, 2014 at 5:14 PM, louis.hust <louis.hust@gmail.com> wrote:

> hi ,esteban,
>
> We use VMS to do the test, and we just want to test the NameNode HA
> failover, so the cluster is so small. ^_^
>
> And we found where the problem is, as below:
> Begin forwarded message:
>
> *From: *"louis.hust" <louis.hust@gmail.com>
> *Subject: **Re: Slow waitForAckedSeqno took too long time*
> *Date: *December 8, 2014 at 17:56:41 GMT+8
> *To: *Ted Yu <yuzhihong@gmail.com>
> *Cc: *hdfs-dev@hadoop.apache.org
>
> Hi, Ted
>
> It seems we found the problem.
>
> Today we tried again with shutdown the name node directly  instead of
> using iptables to simulate the shutdown.
> And it works immediately for HBase. So I think there is something wrong
> with our experiment methods (shutdown VS iptables)
> although we’ve no idea about how HDFS handle the two different exception.
>
> On Dec 8, 2014, at 17:00, Ted Yu <yuzhihong@gmail.com> wrote:
>
> 0.98.6 should run on top of Hadoop 2.3
>
> You can confirm on Cloudera mailing list.
>
> Cheers
>
> On Dec 7, 2014, at 10:29 PM, "louis.hust" <louis.hust@gmail.com> wrote:
>
> Hi , ted
>
> We use hbase 0.98.6-cdh 5.0.2  but the hadoop with 2.3.0-cdh5.0.1 cause we
> upgrade the hbase.
> If there is something compatible between hbase and hadoop?
>
>
> On Dec 8, 2014, at 9:12, louis.hust <louis.hust@gmail.com> wrote:
>
> Thank you very much!
>
> On Dec 8, 2014, at 0:38, Ted Yu <yuzhihong@gmail.com> wrote:
>
> Louis:
> I am still studying the log you sent.
>
> An hdfs develop may get to the root cause faster than I do.
>
> Cheers
>
> On Sun, Dec 7, 2014 at 5:02 AM, louis.hust <louis.hust@gmail.com> wrote:
>
>> Hi, Ted
>>
>> How do you find the problem  at the name node? is there some significant
>> log in the name node log?
>>
>> On Dec 7, 2014, at 10:24, louis.hust <louis.hust@gmail.com> wrote:
>>
>> BTW, I use hbase 0.98.6-cdh 5.0.2  but the hadoop with 2.3.0-cdh5.0.1
>> cause we upgrade the hbase.
>> On Dec 6, 2014, at 23:23, Ted Yu <yuzhihong@gmail.com> wrote:
>>
>> Forwarding to hdfs-dev@ since the logs were pertaining to namenode.
>>
>> Cheers
>>
>> ---------- Forwarded message ----------
>> From: louis.hust <louis.hust@gmail.com>
>> Date: Fri, Dec 5, 2014 at 8:52 PM
>> Subject: Re: Slow waitForAckedSeqno took too long time
>> To: user@hbase.apache.org
>>
>>
>> Hi Ted,
>>
>> Thanks for your reply.
>>
>> Actually I shutdown the l-hbase1.dev.dba.cn0 at the time 11:47:51 and
>> the hbase recover at 12:03
>>
>> the below log is the zkfc log on l-hbase2
>>
>> {log}
>> 2014-12-05 11:18:47,077 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>> Already in election. Not re-connecting.
>> 2014-12-05 11:48:16,023 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>> Checking for any old active which needs to be fenced...
>> 2014-12-05 11:48:16,025 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>> Old node exists:
>> 0a096d79636c757374657212036e6e311a146c2d6862617365312e6462612e6465762e636e3020d43e28d33e
>> 2014-12-05 11:48:16,028 INFO org.apache.hadoop.ha.ZKFailoverController:
>> Should fence: NameNode at l-hbase1.dba.dev.cn0/10.86.36.217:8020
>> 2014-12-05 11:48:32,031 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried
>> 0 time(s); retry policy is
>> RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000
>> MILLISECONDS)
>> 2014-12-05 11:48:47,033 WARN org.apache.hadoop.ha.FailoverController:
>> Unable to gracefully make NameNode at l-hbase1.dba.dev.cn0/
>> 10.86.36.217:8020 standby (unable to connect)
>> java.net.ConnectException: Call From l-hbase2.dba.dev.cn0.
>> qunar.com/10.86.36.218 to l-hbase1.dba.dev.cn0:8020 failed on connection
>> exception: java.net.ConnectException: Connection timed out; For more
>> details see:  http://wiki.apache.org/hadoop/ConnectionRefused
>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>> at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>> at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
>> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
>> at org.apache.hadoop.ipc.Client.call(Client.java:1413)
>> at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>> at
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>> at com.sun.proxy.$Proxy8.transitionToStandby(Unknown Source)
>> at
>> org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112)
>> at
>> org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
>> at
>> org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:503)
>> at
>> org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
>> at
>> org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
>> at
>> org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
>> at
>> org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:901)
>> at
>> org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:800)
>> at
>> org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>> at
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
>> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
>> Caused by: java.net.ConnectException: Connection timed out
>> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
>> at
>> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
>> at
>> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
>> at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
>> at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
>> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1461)
>> at org.apache.hadoop.ipc.Client.call(Client.java:1380)
>> ... 14 more
>> 2014-12-05 11:48:47,034 INFO org.apache.hadoop.ha.NodeFencer: ======
>> Beginning Service Fencing Process... ======
>> 2014-12-05 11:48:47,034 INFO org.apache.hadoop.ha.NodeFencer: Trying
>> method 1/1: org.apache.hadoop.ha.ShellCommandFencer(/bin/true)
>> 2014-12-05 11:48:47,040 INFO org.apache.hadoop.ha.ShellCommandFencer:
>> Launched fencing command '/bin/true' with pid 24513
>> 2014-12-05 11:48:47,044 INFO org.apache.hadoop.ha.NodeFencer: ======
>> Fencing successful by method
>> org.apache.hadoop.ha.ShellCommandFencer(/bin/true) ======
>> 2014-12-05 11:48:47,044 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>> Writing znode /hadoop-ha/mycluster/ActiveBreadCrumb to indicate that the
>> local node is the most recent active...
>> 2014-12-05 11:48:47,063 INFO org.apache.hadoop.ha.ZKFailoverController:
>> Trying to make NameNode at l-hbase2.dba.dev.cn0/10.86.36.218:8020
>> active...
>> 2014-12-05 11:48:47,284 INFO org.apache.hadoop.ha.ZKFailoverController:
>> Successfully transitioned NameNode at l-hbase2.dba.dev.cn0/
>> 10.86.36.218:8020 to active state
>> 2014-12-05 12:29:56,906 INFO
>> org.apache.hadoop.hdfs.tools.DFSZKFailoverController: Allowed RPC access
>> from hadoop (auth:SIMPLE) at 10.86.36.217
>> 2014-12-05 12:29:56,907 INFO org.apache.hadoop.ha.ZKFailoverController:
>> Requested by hadoop (auth:SIMPLE) at 10.86.36.217 to cede active role.
>> 2014-12-05 12:29:56,943 INFO org.apache.hadoop.ha.ZKFailoverController:
>> Successfully ensured local node is in standby mode
>> 2014-12-05 12:29:56,943 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>> Yielding from election
>> 2014-12-05 12:29:56,943 INFO org.apache.hadoop.ha.ActiveStandbyElector:
>> Deleting bread-crumb of active node…
>>
>> {log}
>>
>>
>> the name node log is a bit large, so attach it.
>>
>>
>> On Dec 5, 2014, at 23:01, Ted Yu <yuzhihong@gmail.com> wrote:
>>
>> The warning was logged by DFSOutputStream.
>>
>> What was the load on hdfs around 2014-12-05 12:03 ?
>> Have you checked namenode log ?
>>
>> Cheers
>>
>> On Thu, Dec 4, 2014 at 9:01 PM, mail list <louis.hust.ml@gmail.com>
>> wrote:
>>
>> Hi ,all
>>
>> I deploy Hbase0.98.6-cdh5.2.0 on 3 machine:
>>
>> l-hbase1.dev.dba.cn0(hadoop namenode active, HMaster active)
>> l-hbase2.dev.dba.cn0(hadoop namenode standby, HMaster standby, hadoop
>> datanode)
>> l-hbase3.dev.dba.cn0(regionserver, hadoop datanode)
>>
>> Then I shutdown the l-hbase1.dev.dba.cn0,  But HBase can not work until
>> about 15mins later.
>> I check the log and find the following log in the region server’s log:
>>
>> 2014-12-05 12:03:19,169 WARN  [regionserver60020-WAL.AsyncSyncer0]
>> hdfs.DFSClient: Slow waitForAckedSeqno took 927762ms (threshold=30000ms)
>> 2014-12-05 12:03:19,186 INFO  [regionserver60020-WAL.AsyncSyncer0]
>> wal.FSHLog: Slow sync cost: 927779 ms, current pipeline: [
>> 10.86.36.219:50010]
>> 2014-12-05 12:03:19,186 DEBUG [regionserver60020.logRoller]
>> regionserver.LogRoller: HLog roll requested
>> 2014-12-05 12:03:19,236 WARN  [regionserver60020-WAL.AsyncSyncer1]
>> hdfs.DFSClient: Slow waitForAckedSeqno took 867706ms (threshold=30000ms)
>>
>> It seems the WAL Asysnc took too long time for region server recovery? I
>> don’t know if the log matters ?
>> Can any body explain the reason? and how to reduce the time for recovery?
>>
>>
>>
>>
>>
>>
>> <l-hbase2.namenode.log>
>>
>>
>>
>>
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message