zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From maca.ee...@g2.nctu.edu.tw
Subject Re: Large number of errored transactions and periodic node disconnects on 3.4.6
Date Wed, 14 Jan 2015 23:16:49 GMT
Thank for your kindly reply but I can't fully understand your discussion.
Basically I have two devices (iPhone 5 and iPad Air ) but sometimes I also play with other
iOS devices . Anyway the iPhone 5 could be my first priority device. Can you recovery my character
status to the situation before your serve maintain 
(I tried to delete app and reinstalled it but still can't connect the serve)
Thank for your help~

從我的 iPhone 傳送


> R. Tyler Croy <tyler@monkeypox.org> 於 2015年1月15日 上午6:01 寫道:
> 
> Replying to the list to make sure that Ben's responses get archived below
> 
>> On Wed, 14 Jan 2015, Ben Smith wrote:
>> 
>> On Wed, Jan 14, 2015 at 1:28 PM, Flavio Junqueira <fpjunqueira@yahoo.com>
>> wrote:
>> 
>>> Do you have anything running along with the ZK servers on the same
>>> machine?
>> 
>> 
>> No the zookeeper machines are 'dedicated'. The only other stuff running is
>> infrastructure related (metrics and log collection agents that incur no/low
>> load).
>> 
>> 
>>> Is that a virtualized environment?
>> 
>> 
>> Yes, this is in in the 'cloud' (: They are on AWS c3.large instances. The
>> instance CPU and Memory resources are really low.
>> 
>> 
>>> I'm wondering if a concurrent workload is causing servers to lag behind
>>> and drop the ensemble. Also, how many disk devices are you using? is there
>>> a dedicated device for the txn log?
>> 
>> They are only using one device each, it is not dedicated, but it also isn't
>> anywhere close to saturation, using iostat to validate svctm:
>> 
>> https://gist.github.com/polynomial/2dc6aefefd7fa31afc73
>> 
>> Here is the config if that helps too:
>> 
>> $ echo 'conf' | nc localhost 2181
>> clientPort=2181
>> dataDir=/path/to/zookeeper/snapshots/version-2
>> dataLogDir=/path/to/zookeeper/transactions/version-2
>> tickTime=2000
>> maxClientCnxns=60
>> minSessionTimeout=4000
>> maxSessionTimeout=40000
>> serverId=4
>> initLimit=10
>> syncLimit=5
>> electionAlg=3
>> electionPort=3888
>> quorumPort=2888
>> peerType=0
>> 
>> -ben
>> 
>> 
>>> 
>>> -Flavio
>>> 
>>>> On 14 Jan 2015, at 19:07, R. Tyler Croy <tyler@monkeypox.org> wrote:
>>>> 
>>>> (replies inline)
>>>> 
>>>>> On Wed, 14 Jan 2015, Flavio Junqueira wrote:
>>>>> 
>>>>> I'm not sure what could be causing the communication between servers
to
>>> fail.
>>>>> Are you sure this is a ZK problem? Could it be a problem with your
>>> setting?
>>>> 
>>>> 
>>>> We're not certain of much at this point to be honest. We've executed some
>>>> network stress tests and see no disruptions in the layer 1-3 network
>>>> communications.
>>>> 
>>>> I'm not aware of any Zookeeper server-side setting that would result in
>>> the
>>>> behavior we're seeing but I'm happy to try out some configuration
>>> changes if
>>>> someboyd believes those will either help alleviate the problem or help
>>> identify
>>>> it with more exactness.
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On Wednesday, January 14, 2015 4:33 PM, R. Tyler Croy <
>>> tyler@monkeypox.org> wrote:
>>>>> 
>>>>> 
>>>>> A colleague of mine and I are still fully trying to understand the
>>> behavior
>>>>> we're seeing from one of our Zookeeper clusters running 3.4.6.
>>>>> 
>>>>> Some log snippets are included further down, but the symptom of what
>>> we're
>>>>> seeing is that nodes in the cluster periodically detach themselves
>>> entirely
>>>>> from the rest of the cluster. The period is not however consistent, in
>>> at least
>>>>> one known case we've seen multiple nodes detach around the same time
>>> causing
>>>>> leader re-election challenges (putting it kindly!).
>>>>> 
>>>>> Our investigation has led us to see a large number of transaction errors
>>>>> (txnType: -1) in our trace logs with "n/a" paths. To give you an idea
>>> of how
>>>>> large of a number we're talking about, here's a summary from ~20min of
>>> logs:
>>>>> 
>>>>> -> # grep txntype zookeeper.log* | perl -pe 's/.*txntype/txntype/'
|
>>> sort | uniq -c | sort -nr
>>>>> 70585 txntype:-1 reqpath:n/a
>>>>> 4860 txntype:5 reqpath:n/a
>>>>> 7 txntype:-11 reqpath:n/a
>>>>> 7 txntype:-10 reqpath:n/a
>>>>> [..snip..]
>>>>> 
>>>>> The number of session create/close transaction types looks about right
>>> as does
>>>>> the number of 'set data' transactions; however the "reqpath" being
>>> "n/a" does
>>>>> not look entirely correct.
>>>>> 
>>>>> 
>>>>> We're only using Zookeeper to hold offsets for normal Kafka high-level
>>>>> consumers and Storm's KafkaSpout consumers, and I'm relatively
>>> confident that
>>>>> those consumers are behaving correctly.
>>>>> 
>>>>> 
>>>>> We have also come across a number of ostensibly resolved bugs in JIRA
>>> that, to
>>>>> our untrained eyes, look similar to what we're seeing:
>>>>> 
>>>>> "Zookeeper 3.3.4 loses ephemeral nodes under stress"
>>>>>    <https://issues.apache.org/jira/browse/ZOOKEEPER-1740>
>>>>> "the threads number of a zookeeper is increased all the time"
>>>>>    <https://issues.apache.org/jira/browse/ZOOKEEPER-939> (not seeing
>>> the
>>>>>    thread growth mentioned here, just log similarities)
>>>>> 
>>>>> 
>>>>> The node detaching issue is not something we've been able to recreate
>>> on our
>>>>> own, so I don't entirely have a reproduction case. This is ZK 3.4.6 on
>>> OpenJDK
>>>>> 1.7.0u67 running on Ubuntu 12.04, other than that I'm happy to provide
>>>>> information that might help determine the cause of our problems!
>>>>> 
>>>>> 
>>>>> Cheers (logs snippets to follow)
>>>>> 
>>>>> 
>>>>> Error logs from an incident where a node detached itself:
>>>>> 
>>>>> 2015-01-02 20:10:37,576 [myid:4] - ERROR
>>> [LearnerHandler-/10.96.100.69:54260:LearnerHandler@633] - Unexpected
>>> exception causing shutdown while sock still open
>>>>> java.net.SocketTimeoutException: Read timed out
>>>>>        at java.net.SocketInputStream.socketRead0(Native Method)
>>>>>        at java.net.SocketInputStream.read(SocketInputStream.java:152)
>>>>>        at java.net.SocketInputStream.read(SocketInputStream.java:122)
>>>>>        at
>>> java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>>>>>        at
>>> java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>>>>>        at java.io.DataInputStream.readInt(DataInputStream.java:387)
>>>>>        at
>>> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>>>>>        at
>>> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>>>>>        at
>>> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>>>>>        at
>>> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
>>>>> 2015-01-02 20:10:37,577 [myid:4] - ERROR
>>> [LearnerHandler-/10.96.100.90:43220:LearnerHandler@633] - Unexpected
>>> exception causing shutdown while sock still open
>>>>> java.net.SocketTimeoutException: Read timed out
>>>>>        at java.net.SocketInputStream.socketRead0(Native Method)
>>>>>        at java.net.SocketInputStream.read(SocketInputStream.java:152)
>>>>>        at java.net.SocketInputStream.read(SocketInputStream.java:122)
>>>>>        at
>>> java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>>>>>        at
>>> java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>>>>>        at java.io.DataInputStream.readInt(DataInputStream.java:387)
>>>>>        at
>>> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>>>>>        at
>>> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>>>>>        at
>>> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>>>>>        at
>>> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
>>>>> 2015-01-02 20:10:37,577 [myid:4] - WARN
>>> [LearnerHandler-/10.96.100.69:54260:LearnerHandler@646] - ******* GOODBYE
>>> /10.96.100.69:54260 ********
>>>>> 2015-01-02 20:10:37,576 [myid:4] - ERROR
>>> [LearnerHandler-/10.96.100.136:48175:LearnerHandler@633] - Unexpected
>>> exception causing shutdown while sock still open
>>>>> java.net.SocketTimeoutException: Read timed out
>>>>>        at java.net.SocketInputStream.socketRead0(Native Method)
>>>>>        at java.net.SocketInputStream.read(SocketInputStream.java:152)
>>>>>        at java.net.SocketInputStream.read(SocketInputStream.java:122)
>>>>>        at
>>> java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>>>>>        at
>>> java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>>>>>        at java.io.DataInputStream.readInt(DataInputStream.java:387)
>>>>>        at
>>> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>>>>>        at
>>> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>>>>>        at
>>> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>>>>>        at
>>> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
>>>>> 2015-01-02 20:10:37,578 [myid:4] - WARN
>>> [LearnerHandler-/10.96.14.103:47982:LearnerHandler@646] - ******* GOODBYE
>>> /10.96.14.103:47982 ********
>>>>> 2015-01-02 20:10:37,577 [myid:4] - ERROR
>>> [LearnerHandler-/10.84.11.131:47798:LearnerHandler@633] - Unexpected
>>> exception causing shutdown while sock still open
>>>>> java.net.SocketTimeoutException: Read timed out
>>>>>        at java.net.SocketInputStream.socketRead0(Native Method)
>>>>>        at java.net.SocketInputStream.read(SocketInputStream.java:152)
>>>>>        at java.net.SocketInputStream.read(SocketInputStream.java:122)
>>>>>        at
>>> java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>>>>>        at
>>> java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>>>>>        at java.io.DataInputStream.readInt(DataInputStream.java:387)
>>>>>        at
>>> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>>>>>        at
>>> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>>>>>        at
>>> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>>>>>        at
>>> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
>>>>> 2015-01-02 20:10:37,578 [myid:4] - WARN
>>> [LearnerHandler-/10.84.11.131:47798:LearnerHandler@646] - ******* GOODBYE
>>> /10.84.11.131:47798 ********
>>>>> 2015-01-02 20:10:37,578 [myid:4] - WARN
>>> [LearnerHandler-/10.96.100.136:48175:LearnerHandler@646] - *******
>>> GOODBYE /10.96.100.136:48175 ********
>>>>> 
>>>>> 
>>>>> 
>>>>> Some trace logs from "normal operation" that looked suspect to us:
>>>>> 
>>>>> 
>>>>> 2015-01-12 14:05:12,753 [myid:1] - TRACE
>>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooTrace@71] - loadData â??????
>>> session in archive: 381306787892823513 with timeout: 6000
>>>>> 2015-01-12 14:05:12,797 [myid:1] - INFO
>>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@240] -
>>> Snapshotting: 0xb3ea0ee8f4db to
>>> /srv/zookeeper/snapshots/version-2/snapshot.b3ea0ee8f4db
>>>>> 2015-01-12 14:05:12,822 [myid:1] - WARN
>>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Learner@374] - Got zxid
>>> 0xb3ea0ee8f4dc expected 0x1
>>>>> 2015-01-12 14:05:12,833 [myid:1] - DEBUG
>>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] -
>>> Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483a
>>> zxid:0xb3ea0ee8f4dc txntype:-1 reqpath:n/a
>>>>> 2015-01-12 14:05:12,833 [myid:1] - DEBUG
>>> [CommitProcessor:1:FinalRequestProcessor@88] - Processing request::
>>> sessionid:0x24aac647ab40047 type:error cxid:0x29df483a zxid:0xb3ea0ee8f4dc
>>> txntype:-1 reqpath:n/a
>>>>> 2015-01-12 14:05:12,838 [myid:1] - TRACE [CommitProcessor:1:ZooTrace@90]
>>> - :Esessionid:0x24aac647ab40047 type:error cxid:0x29df483a
>>> zxid:0xb3ea0ee8f4dc txntype:-1 reqpath:n/a
>>>>> 2015-01-12 14:05:12,838 [myid:1] - DEBUG [CommitProcessor:1:DataTree@949]
>>> - Ignoring processTxn failure hdr: -1 : error: -110
>>>>> 2015-01-12 14:05:12,838 [myid:1] - DEBUG
>>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] -
>>> Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483d
>>> zxid:0xb3ea0ee8f4dd txntype:-1 reqpath:n/a
>>>>> 2015-01-12 14:05:12,839 [myid:1] - DEBUG
>>> [CommitProcessor:1:FinalRequestProcessor@88] - Processing request::
>>> sessionid:0x24aac647ab40047 type:error cxid:0x29df483d zxid:0xb3ea0ee8f4dd
>>> txntype:-1 reqpath:n/a
>>>>> 2015-01-12 14:05:12,839 [myid:1] - TRACE [CommitProcessor:1:ZooTrace@90]
>>> - :Esessionid:0x24aac647ab40047 type:error cxid:0x29df483d
>>> zxid:0xb3ea0ee8f4dd txntype:-1 reqpath:n/a
>>>>> 2015-01-12 14:05:12,839 [myid:1] - DEBUG [CommitProcessor:1:DataTree@949]
>>> - Ignoring processTxn failure hdr: -1 : error: -110
>>>>> 2015-01-12 14:05:12,839 [myid:1] - DEBUG
>>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] -
>>> Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df4840
>>> zxid:0xb3ea0ee8f4de txntype:-1 reqpath:n/a
>>>>> 2015-01-12 14:05:12,839 [myid:1] - DEBUG
>>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161] -
>>> Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df4843
>>> zxid:0xb3ea0ee8f4df txntype:-1 reqpath:n/a
>>> 
>>> 

Mime
View raw message