hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "louis.hust" <louis.h...@gmail.com>
Subject Re: Slow waitForAckedSeqno took too long time
Date Tue, 09 Dec 2014 02:13:07 GMT
Hi esteban.

Do you have any idea about the different between shutdown and iptables?
Why the behaviour is so much different ?

On Dec 9, 2014, at 9:25, Esteban Gutierrez <esteban@cloudera.com> wrote:

> 
> Hi Louis,
> 
> That makes sense too.
>  
> Thanks for keeping us posted!
> 
> 
> 
> --
> Cloudera, Inc.
> 
> 
> On Mon, Dec 8, 2014 at 5:14 PM, louis.hust <louis.hust@gmail.com> wrote:
> hi ,esteban,
> 
> We use VMS to do the test, and we just want to test the NameNode HA failover, so the
cluster is so small. ^_^
> 
> And we found where the problem is, as below:
> Begin forwarded message:
> 
>> From: "louis.hust" <louis.hust@gmail.com>
>> Subject: Re: Slow waitForAckedSeqno took too long time
>> Date: December 8, 2014 at 17:56:41 GMT+8
>> To: Ted Yu <yuzhihong@gmail.com>
>> Cc: hdfs-dev@hadoop.apache.org
>> 
>> Hi, Ted
>> 
>> It seems we found the problem.
>> 
>> Today we tried again with shutdown the name node directly  instead of using iptables
to simulate the shutdown.
>> And it works immediately for HBase. So I think there is something wrong with our
experiment methods (shutdown VS iptables)
>> although we’ve no idea about how HDFS handle the two different exception.
>> 
>> On Dec 8, 2014, at 17:00, Ted Yu <yuzhihong@gmail.com> wrote:
>> 
>>> 0.98.6 should run on top of Hadoop 2.3
>>> 
>>> You can confirm on Cloudera mailing list. 
>>> 
>>> Cheers
>>> 
>>> On Dec 7, 2014, at 10:29 PM, "louis.hust" <louis.hust@gmail.com> wrote:
>>> 
>>>> Hi , ted
>>>> 
>>>> We use hbase 0.98.6-cdh 5.0.2  but the hadoop with 2.3.0-cdh5.0.1 cause we
upgrade the hbase.
>>>> If there is something compatible between hbase and hadoop?
>>>> 
>>>> 
>>>> On Dec 8, 2014, at 9:12, louis.hust <louis.hust@gmail.com> wrote:
>>>> 
>>>>> Thank you very much!
>>>>> 
>>>>> On Dec 8, 2014, at 0:38, Ted Yu <yuzhihong@gmail.com> wrote:
>>>>> 
>>>>>> Louis:
>>>>>> I am still studying the log you sent.
>>>>>> 
>>>>>> An hdfs develop may get to the root cause faster than I do.
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> On Sun, Dec 7, 2014 at 5:02 AM, louis.hust <louis.hust@gmail.com>
wrote:
>>>>>> Hi, Ted
>>>>>> 
>>>>>> How do you find the problem  at the name node? is there some significant
>>>>>> log in the name node log?
>>>>>> 
>>>>>> On Dec 7, 2014, at 10:24, louis.hust <louis.hust@gmail.com>
wrote:
>>>>>> 
>>>>>>> BTW, I use hbase 0.98.6-cdh 5.0.2  but the hadoop with 2.3.0-cdh5.0.1
cause we upgrade the hbase.
>>>>>>> On Dec 6, 2014, at 23:23, Ted Yu <yuzhihong@gmail.com>
wrote:
>>>>>>> 
>>>>>>>> Forwarding to hdfs-dev@ since the logs were pertaining to
namenode.
>>>>>>>> 
>>>>>>>> Cheers
>>>>>>>> 
>>>>>>>> ---------- Forwarded message ----------
>>>>>>>> From: louis.hust <louis.hust@gmail.com>
>>>>>>>> Date: Fri, Dec 5, 2014 at 8:52 PM
>>>>>>>> Subject: Re: Slow waitForAckedSeqno took too long time
>>>>>>>> To: user@hbase.apache.org
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi Ted,
>>>>>>>> 
>>>>>>>> Thanks for your reply.
>>>>>>>> 
>>>>>>>> Actually I shutdown the l-hbase1.dev.dba.cn0 at the time
11:47:51 and the hbase recover at 12:03
>>>>>>>> 
>>>>>>>> the below log is the zkfc log on l-hbase2
>>>>>>>> 
>>>>>>>> {log}
>>>>>>>> 2014-12-05 11:18:47,077 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Already in election. Not re-connecting.
>>>>>>>> 2014-12-05 11:48:16,023 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Checking for any old active which needs to be fenced...
>>>>>>>> 2014-12-05 11:48:16,025 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Old node exists: 0a096d79636c757374657212036e6e311a146c2d6862617365312e6462612e6465762e636e3020d43e28d33e
>>>>>>>> 2014-12-05 11:48:16,028 INFO org.apache.hadoop.ha.ZKFailoverController:
Should fence: NameNode at l-hbase1.dba.dev.cn0/10.86.36.217:8020
>>>>>>>> 2014-12-05 11:48:32,031 INFO org.apache.hadoop.ipc.Client:
Retrying connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8020. Already tried 0 time(s);
retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
>>>>>>>> 2014-12-05 11:48:47,033 WARN org.apache.hadoop.ha.FailoverController:
Unable to gracefully make NameNode at l-hbase1.dba.dev.cn0/10.86.36.217:8020 standby (unable
to connect)
>>>>>>>> java.net.ConnectException: Call From l-hbase2.dba.dev.cn0.qunar.com/10.86.36.218
to l-hbase1.dba.dev.cn0:8020 failed on connection exception: java.net.ConnectException: Connection
timed out; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
>>>>>>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
>>>>>>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>>>>>>> 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>>>>>>> 	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>>>>>>> 	at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
>>>>>>>> 	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
>>>>>>>> 	at org.apache.hadoop.ipc.Client.call(Client.java:1413)
>>>>>>>> 	at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>>>>>>>> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>>>>>>>> 	at com.sun.proxy.$Proxy8.transitionToStandby(Unknown Source)
>>>>>>>> 	at org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112)
>>>>>>>> 	at org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
>>>>>>>> 	at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:503)
>>>>>>>> 	at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:494)
>>>>>>>> 	at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:59)
>>>>>>>> 	at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:837)
>>>>>>>> 	at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:901)
>>>>>>>> 	at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:800)
>>>>>>>> 	at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)
>>>>>>>> 	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
>>>>>>>> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
>>>>>>>> Caused by: java.net.ConnectException: Connection timed out
>>>>>>>> 	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>>>>>>> 	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
>>>>>>>> 	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>>>>>>>> 	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
>>>>>>>> 	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
>>>>>>>> 	at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:604)
>>>>>>>> 	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:699)
>>>>>>>> 	at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
>>>>>>>> 	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1461)
>>>>>>>> 	at org.apache.hadoop.ipc.Client.call(Client.java:1380)
>>>>>>>> 	... 14 more
>>>>>>>> 2014-12-05 11:48:47,034 INFO org.apache.hadoop.ha.NodeFencer:
====== Beginning Service Fencing Process... ======
>>>>>>>> 2014-12-05 11:48:47,034 INFO org.apache.hadoop.ha.NodeFencer:
Trying method 1/1: org.apache.hadoop.ha.ShellCommandFencer(/bin/true)
>>>>>>>> 2014-12-05 11:48:47,040 INFO org.apache.hadoop.ha.ShellCommandFencer:
Launched fencing command '/bin/true' with pid 24513
>>>>>>>> 2014-12-05 11:48:47,044 INFO org.apache.hadoop.ha.NodeFencer:
====== Fencing successful by method org.apache.hadoop.ha.ShellCommandFencer(/bin/true) ======
>>>>>>>> 2014-12-05 11:48:47,044 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Writing znode /hadoop-ha/mycluster/ActiveBreadCrumb to indicate that the local node is the
most recent active...
>>>>>>>> 2014-12-05 11:48:47,063 INFO org.apache.hadoop.ha.ZKFailoverController:
Trying to make NameNode at l-hbase2.dba.dev.cn0/10.86.36.218:8020 active...
>>>>>>>> 2014-12-05 11:48:47,284 INFO org.apache.hadoop.ha.ZKFailoverController:
Successfully transitioned NameNode at l-hbase2.dba.dev.cn0/10.86.36.218:8020 to active state
>>>>>>>> 2014-12-05 12:29:56,906 INFO org.apache.hadoop.hdfs.tools.DFSZKFailoverController:
Allowed RPC access from hadoop (auth:SIMPLE) at 10.86.36.217
>>>>>>>> 2014-12-05 12:29:56,907 INFO org.apache.hadoop.ha.ZKFailoverController:
Requested by hadoop (auth:SIMPLE) at 10.86.36.217 to cede active role.
>>>>>>>> 2014-12-05 12:29:56,943 INFO org.apache.hadoop.ha.ZKFailoverController:
Successfully ensured local node is in standby mode
>>>>>>>> 2014-12-05 12:29:56,943 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Yielding from election
>>>>>>>> 2014-12-05 12:29:56,943 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Deleting bread-crumb of active node…
>>>>>>>> 
>>>>>>>> {log}
>>>>>>>> 
>>>>>>>> 
>>>>>>>> the name node log is a bit large, so attach it.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Dec 5, 2014, at 23:01, Ted Yu <yuzhihong@gmail.com>
wrote:
>>>>>>>> 
>>>>>>>>> The warning was logged by DFSOutputStream.
>>>>>>>>> 
>>>>>>>>> What was the load on hdfs around 2014-12-05 12:03 ?
>>>>>>>>> Have you checked namenode log ?
>>>>>>>>> 
>>>>>>>>> Cheers
>>>>>>>>> 
>>>>>>>>> On Thu, Dec 4, 2014 at 9:01 PM, mail list <louis.hust.ml@gmail.com>
wrote:
>>>>>>>>> 
>>>>>>>>>> Hi ,all
>>>>>>>>>> 
>>>>>>>>>> I deploy Hbase0.98.6-cdh5.2.0 on 3 machine:
>>>>>>>>>> 
>>>>>>>>>> l-hbase1.dev.dba.cn0(hadoop namenode active, HMaster
active)
>>>>>>>>>> l-hbase2.dev.dba.cn0(hadoop namenode standby, HMaster
standby, hadoop
>>>>>>>>>> datanode)
>>>>>>>>>> l-hbase3.dev.dba.cn0(regionserver, hadoop datanode)
>>>>>>>>>> 
>>>>>>>>>> Then I shutdown the l-hbase1.dev.dba.cn0,  But HBase
can not work until
>>>>>>>>>> about 15mins later.
>>>>>>>>>> I check the log and find the following log in the
region server’s log:
>>>>>>>>>> 
>>>>>>>>>> 2014-12-05 12:03:19,169 WARN  [regionserver60020-WAL.AsyncSyncer0]
>>>>>>>>>> hdfs.DFSClient: Slow waitForAckedSeqno took 927762ms
(threshold=30000ms)
>>>>>>>>>> 2014-12-05 12:03:19,186 INFO  [regionserver60020-WAL.AsyncSyncer0]
>>>>>>>>>> wal.FSHLog: Slow sync cost: 927779 ms, current pipeline:
[
>>>>>>>>>> 10.86.36.219:50010]
>>>>>>>>>> 2014-12-05 12:03:19,186 DEBUG [regionserver60020.logRoller]
>>>>>>>>>> regionserver.LogRoller: HLog roll requested
>>>>>>>>>> 2014-12-05 12:03:19,236 WARN  [regionserver60020-WAL.AsyncSyncer1]
>>>>>>>>>> hdfs.DFSClient: Slow waitForAckedSeqno took 867706ms
(threshold=30000ms)
>>>>>>>>>> 
>>>>>>>>>> It seems the WAL Asysnc took too long time for region
server recovery? I
>>>>>>>>>> don’t know if the log matters ?
>>>>>>>>>> Can any body explain the reason? and how to reduce
the time for recovery?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> <l-hbase2.namenode.log>
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
> 
> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message