zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Junqueira <fpjunque...@yahoo.com.INVALID>
Subject Re: Large number of errored transactions and periodic node disconnects on 3.4.6
Date Wed, 14 Jan 2015 21:28:32 GMT
Do you have anything running along with the ZK servers on the same machine? Is that a virtualized
environment? I'm wondering if a concurrent workload is causing servers to lag behind and drop
the ensemble. Also, how many disk devices are you using? is there a dedicated device for the
txn log?

-Flavio

> On 14 Jan 2015, at 19:07, R. Tyler Croy <tyler@monkeypox.org> wrote:
> 
> (replies inline)
> 
> On Wed, 14 Jan 2015, Flavio Junqueira wrote:
> 
>> I'm not sure what could be causing the communication between servers to fail.
>> Are you sure this is a ZK problem? Could it be a problem with your setting?
> 
> 
> We're not certain of much at this point to be honest. We've executed some
> network stress tests and see no disruptions in the layer 1-3 network
> communications.
> 
> I'm not aware of any Zookeeper server-side setting that would result in the
> behavior we're seeing but I'm happy to try out some configuration changes if
> someboyd believes those will either help alleviate the problem or help identify
> it with more exactness.
> 
> 
> 
> 
>> On Wednesday, January 14, 2015 4:33 PM, R. Tyler Croy <tyler@monkeypox.org>
wrote:
>> 
>> 
>> A colleague of mine and I are still fully trying to understand the behavior
>> we're seeing from one of our Zookeeper clusters running 3.4.6.
>> 
>> Some log snippets are included further down, but the symptom of what we're
>> seeing is that nodes in the cluster periodically detach themselves entirely
>> from the rest of the cluster. The period is not however consistent, in at least
>> one known case we've seen multiple nodes detach around the same time causing
>> leader re-election challenges (putting it kindly!).
>> 
>> Our investigation has led us to see a large number of transaction errors
>> (txnType: -1) in our trace logs with "n/a" paths. To give you an idea of how
>> large of a number we're talking about, here's a summary from ~20min of logs:
>> 
>> -> # grep txntype zookeeper.log* | perl -pe 's/.*txntype/txntype/' | sort | uniq
-c | sort -nr
>> 70585 txntype:-1 reqpath:n/a
>> 4860 txntype:5 reqpath:n/a
>> 7 txntype:-11 reqpath:n/a
>> 7 txntype:-10 reqpath:n/a
>> [..snip..]
>> 
>> The number of session create/close transaction types looks about right as does
>> the number of 'set data' transactions; however the "reqpath" being "n/a" does
>> not look entirely correct.
>> 
>> 
>> We're only using Zookeeper to hold offsets for normal Kafka high-level
>> consumers and Storm's KafkaSpout consumers, and I'm relatively confident that
>> those consumers are behaving correctly.
>> 
>> 
>> We have also come across a number of ostensibly resolved bugs in JIRA that, to
>> our untrained eyes, look similar to what we're seeing:
>> 
>> "Zookeeper 3.3.4 loses ephemeral nodes under stress"
>>     <https://issues.apache.org/jira/browse/ZOOKEEPER-1740>
>> "the threads number of a zookeeper is increased all the time"
>>     <https://issues.apache.org/jira/browse/ZOOKEEPER-939> (not seeing the
>>     thread growth mentioned here, just log similarities)
>> 
>> 
>> The node detaching issue is not something we've been able to recreate on our
>> own, so I don't entirely have a reproduction case. This is ZK 3.4.6 on OpenJDK
>> 1.7.0u67 running on Ubuntu 12.04, other than that I'm happy to provide
>> information that might help determine the cause of our problems!
>> 
>> 
>> Cheers (logs snippets to follow)
>> 
>> 
>> Error logs from an incident where a node detached itself:
>> 
>> 2015-01-02 20:10:37,576 [myid:4] - ERROR [LearnerHandler-/10.96.100.69:54260:LearnerHandler@633]
- Unexpected exception causing shutdown while sock still open
>> java.net.SocketTimeoutException: Read timed out
>>         at java.net.SocketInputStream.socketRead0(Native Method)
>>         at java.net.SocketInputStream.read(SocketInputStream.java:152)
>>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
>>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>>         at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>>         at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>>         at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>>         at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>>         at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
>> 2015-01-02 20:10:37,577 [myid:4] - ERROR [LearnerHandler-/10.96.100.90:43220:LearnerHandler@633]
- Unexpected exception causing shutdown while sock still open
>> java.net.SocketTimeoutException: Read timed out
>>         at java.net.SocketInputStream.socketRead0(Native Method)
>>         at java.net.SocketInputStream.read(SocketInputStream.java:152)
>>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
>>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>>         at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>>         at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>>         at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>>         at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>>         at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
>> 2015-01-02 20:10:37,577 [myid:4] - WARN  [LearnerHandler-/10.96.100.69:54260:LearnerHandler@646]
- ******* GOODBYE /10.96.100.69:54260 ********
>> 2015-01-02 20:10:37,576 [myid:4] - ERROR [LearnerHandler-/10.96.100.136:48175:LearnerHandler@633]
- Unexpected exception causing shutdown while sock still open
>> java.net.SocketTimeoutException: Read timed out
>>         at java.net.SocketInputStream.socketRead0(Native Method)
>>         at java.net.SocketInputStream.read(SocketInputStream.java:152)
>>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
>>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>>         at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>>         at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>>         at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>>         at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>>         at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
>> 2015-01-02 20:10:37,578 [myid:4] - WARN  [LearnerHandler-/10.96.14.103:47982:LearnerHandler@646]
- ******* GOODBYE /10.96.14.103:47982 ********
>> 2015-01-02 20:10:37,577 [myid:4] - ERROR [LearnerHandler-/10.84.11.131:47798:LearnerHandler@633]
- Unexpected exception causing shutdown while sock still open
>> java.net.SocketTimeoutException: Read timed out
>>         at java.net.SocketInputStream.socketRead0(Native Method)
>>         at java.net.SocketInputStream.read(SocketInputStream.java:152)
>>         at java.net.SocketInputStream.read(SocketInputStream.java:122)
>>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>>         at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>>         at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>>         at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>>         at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>>         at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
>> 2015-01-02 20:10:37,578 [myid:4] - WARN  [LearnerHandler-/10.84.11.131:47798:LearnerHandler@646]
- ******* GOODBYE /10.84.11.131:47798 ********
>> 2015-01-02 20:10:37,578 [myid:4] - WARN  [LearnerHandler-/10.96.100.136:48175:LearnerHandler@646]
- ******* GOODBYE /10.96.100.136:48175 ********
>> 
>> 
>> 
>> Some trace logs from "normal operation" that looked suspect to us:
>> 
>> 
>> 2015-01-12 14:05:12,753 [myid:1] - TRACE [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooTrace@71]
- loadData รข?????? session in archive: 381306787892823513 with timeout: 6000
>> 2015-01-12 14:05:12,797 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@240]
- Snapshotting: 0xb3ea0ee8f4db to /srv/zookeeper/snapshots/version-2/snapshot.b3ea0ee8f4db
>> 2015-01-12 14:05:12,822 [myid:1] - WARN [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Learner@374]
- Got zxid 0xb3ea0ee8f4dc expected 0x1
>> 2015-01-12 14:05:12,833 [myid:1] - DEBUG [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161]
- Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483a zxid:0xb3ea0ee8f4dc
txntype:-1 reqpath:n/a
>> 2015-01-12 14:05:12,833 [myid:1] - DEBUG [CommitProcessor:1:FinalRequestProcessor@88]
- Processing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483a zxid:0xb3ea0ee8f4dc
txntype:-1 reqpath:n/a
>> 2015-01-12 14:05:12,838 [myid:1] - TRACE [CommitProcessor:1:ZooTrace@90] - :Esessionid:0x24aac647ab40047
type:error cxid:0x29df483a zxid:0xb3ea0ee8f4dc txntype:-1 reqpath:n/a
>> 2015-01-12 14:05:12,838 [myid:1] - DEBUG [CommitProcessor:1:DataTree@949] - Ignoring
processTxn failure hdr: -1 : error: -110
>> 2015-01-12 14:05:12,838 [myid:1] - DEBUG [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161]
- Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483d zxid:0xb3ea0ee8f4dd
txntype:-1 reqpath:n/a
>> 2015-01-12 14:05:12,839 [myid:1] - DEBUG [CommitProcessor:1:FinalRequestProcessor@88]
- Processing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483d zxid:0xb3ea0ee8f4dd
txntype:-1 reqpath:n/a
>> 2015-01-12 14:05:12,839 [myid:1] - TRACE [CommitProcessor:1:ZooTrace@90] - :Esessionid:0x24aac647ab40047
type:error cxid:0x29df483d zxid:0xb3ea0ee8f4dd txntype:-1 reqpath:n/a
>> 2015-01-12 14:05:12,839 [myid:1] - DEBUG [CommitProcessor:1:DataTree@949] - Ignoring
processTxn failure hdr: -1 : error: -110
>> 2015-01-12 14:05:12,839 [myid:1] - DEBUG [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161]
- Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df4840 zxid:0xb3ea0ee8f4de
txntype:-1 reqpath:n/a
>> 2015-01-12 14:05:12,839 [myid:1] - DEBUG [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161]
- Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df4843 zxid:0xb3ea0ee8f4df
txntype:-1 reqpath:n/a
>> 
>> 
>> 


Mime
View raw message