zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "R. Tyler Croy" <ty...@monkeypox.org>
Subject Re: Large number of errored transactions and periodic node disconnects on 3.4.6
Date Wed, 14 Jan 2015 19:07:18 GMT
(replies inline)

On Wed, 14 Jan 2015, Flavio Junqueira wrote:

> I'm not sure what could be causing the communication between servers to fail.
> Are you sure this is a ZK problem? Could it be a problem with your setting?


We're not certain of much at this point to be honest. We've executed some
network stress tests and see no disruptions in the layer 1-3 network
communications.

I'm not aware of any Zookeeper server-side setting that would result in the
behavior we're seeing but I'm happy to try out some configuration changes if
someboyd believes those will either help alleviate the problem or help identify
it with more exactness.




>      On Wednesday, January 14, 2015 4:33 PM, R. Tyler Croy <tyler@monkeypox.org>
wrote:
>    
>  
> A colleague of mine and I are still fully trying to understand the behavior
> we're seeing from one of our Zookeeper clusters running 3.4.6.
> 
> Some log snippets are included further down, but the symptom of what we're
> seeing is that nodes in the cluster periodically detach themselves entirely
> from the rest of the cluster. The period is not however consistent, in at least
> one known case we've seen multiple nodes detach around the same time causing
> leader re-election challenges (putting it kindly!).
> 
> Our investigation has led us to see a large number of transaction errors
> (txnType: -1) in our trace logs with "n/a" paths. To give you an idea of how
> large of a number we're talking about, here's a summary from ~20min of logs:
> 
> -> # grep txntype zookeeper.log* | perl -pe 's/.*txntype/txntype/' | sort | uniq -c
| sort -nr
> 70585 txntype:-1 reqpath:n/a
> 4860 txntype:5 reqpath:n/a
> 7 txntype:-11 reqpath:n/a
> 7 txntype:-10 reqpath:n/a
> [..snip..]
> 
> The number of session create/close transaction types looks about right as does
> the number of 'set data' transactions; however the "reqpath" being "n/a" does
> not look entirely correct.
> 
> 
> We're only using Zookeeper to hold offsets for normal Kafka high-level
> consumers and Storm's KafkaSpout consumers, and I'm relatively confident that
> those consumers are behaving correctly.
> 
> 
> We have also come across a number of ostensibly resolved bugs in JIRA that, to
> our untrained eyes, look similar to what we're seeing:
> 
>     "Zookeeper 3.3.4 loses ephemeral nodes under stress"
>         <https://issues.apache.org/jira/browse/ZOOKEEPER-1740>
>     "the threads number of a zookeeper is increased all the time"
>         <https://issues.apache.org/jira/browse/ZOOKEEPER-939> (not seeing the
>         thread growth mentioned here, just log similarities)
> 
> 
> The node detaching issue is not something we've been able to recreate on our
> own, so I don't entirely have a reproduction case. This is ZK 3.4.6 on OpenJDK
> 1.7.0u67 running on Ubuntu 12.04, other than that I'm happy to provide
> information that might help determine the cause of our problems!
> 
> 
> Cheers (logs snippets to follow)
> 
> 
> Error logs from an incident where a node detached itself:
> 
>     2015-01-02 20:10:37,576 [myid:4] - ERROR [LearnerHandler-/10.96.100.69:54260:LearnerHandler@633]
- Unexpected exception causing shutdown while sock still open
>     java.net.SocketTimeoutException: Read timed out
>             at java.net.SocketInputStream.socketRead0(Native Method)
>             at java.net.SocketInputStream.read(SocketInputStream.java:152)
>             at java.net.SocketInputStream.read(SocketInputStream.java:122)
>             at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>             at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>             at java.io.DataInputStream.readInt(DataInputStream.java:387)
>             at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>             at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>             at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>             at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
>     2015-01-02 20:10:37,577 [myid:4] - ERROR [LearnerHandler-/10.96.100.90:43220:LearnerHandler@633]
- Unexpected exception causing shutdown while sock still open
>     java.net.SocketTimeoutException: Read timed out
>             at java.net.SocketInputStream.socketRead0(Native Method)
>             at java.net.SocketInputStream.read(SocketInputStream.java:152)
>             at java.net.SocketInputStream.read(SocketInputStream.java:122)
>             at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>             at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>             at java.io.DataInputStream.readInt(DataInputStream.java:387)
>             at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>             at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>             at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>             at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
>     2015-01-02 20:10:37,577 [myid:4] - WARN  [LearnerHandler-/10.96.100.69:54260:LearnerHandler@646]
- ******* GOODBYE /10.96.100.69:54260 ********
>     2015-01-02 20:10:37,576 [myid:4] - ERROR [LearnerHandler-/10.96.100.136:48175:LearnerHandler@633]
- Unexpected exception causing shutdown while sock still open
>     java.net.SocketTimeoutException: Read timed out
>             at java.net.SocketInputStream.socketRead0(Native Method)
>             at java.net.SocketInputStream.read(SocketInputStream.java:152)
>             at java.net.SocketInputStream.read(SocketInputStream.java:122)
>             at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>             at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>             at java.io.DataInputStream.readInt(DataInputStream.java:387)
>             at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>             at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>             at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>             at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
>     2015-01-02 20:10:37,578 [myid:4] - WARN  [LearnerHandler-/10.96.14.103:47982:LearnerHandler@646]
- ******* GOODBYE /10.96.14.103:47982 ********
>     2015-01-02 20:10:37,577 [myid:4] - ERROR [LearnerHandler-/10.84.11.131:47798:LearnerHandler@633]
- Unexpected exception causing shutdown while sock still open
>     java.net.SocketTimeoutException: Read timed out
>             at java.net.SocketInputStream.socketRead0(Native Method)
>             at java.net.SocketInputStream.read(SocketInputStream.java:152)
>             at java.net.SocketInputStream.read(SocketInputStream.java:122)
>             at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>             at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>             at java.io.DataInputStream.readInt(DataInputStream.java:387)
>             at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>             at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>             at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>             at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
>     2015-01-02 20:10:37,578 [myid:4] - WARN  [LearnerHandler-/10.84.11.131:47798:LearnerHandler@646]
- ******* GOODBYE /10.84.11.131:47798 ********
>     2015-01-02 20:10:37,578 [myid:4] - WARN  [LearnerHandler-/10.96.100.136:48175:LearnerHandler@646]
- ******* GOODBYE /10.96.100.136:48175 ********
> 
> 
> 
> Some trace logs from "normal operation" that looked suspect to us:
> 
> 
>     2015-01-12 14:05:12,753 [myid:1] - TRACE [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooTrace@71]
- loadData â?????? session in archive: 381306787892823513 with timeout: 6000
>     2015-01-12 14:05:12,797 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@240]
- Snapshotting: 0xb3ea0ee8f4db to /srv/zookeeper/snapshots/version-2/snapshot.b3ea0ee8f4db
>     2015-01-12 14:05:12,822 [myid:1] - WARN [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Learner@374]
- Got zxid 0xb3ea0ee8f4dc expected 0x1
>     2015-01-12 14:05:12,833 [myid:1] - DEBUG [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161]
- Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483a zxid:0xb3ea0ee8f4dc
txntype:-1 reqpath:n/a
>     2015-01-12 14:05:12,833 [myid:1] - DEBUG [CommitProcessor:1:FinalRequestProcessor@88]
- Processing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483a zxid:0xb3ea0ee8f4dc
txntype:-1 reqpath:n/a
>     2015-01-12 14:05:12,838 [myid:1] - TRACE [CommitProcessor:1:ZooTrace@90] - :Esessionid:0x24aac647ab40047
type:error cxid:0x29df483a zxid:0xb3ea0ee8f4dc txntype:-1 reqpath:n/a
>     2015-01-12 14:05:12,838 [myid:1] - DEBUG [CommitProcessor:1:DataTree@949] - Ignoring
processTxn failure hdr: -1 : error: -110
>     2015-01-12 14:05:12,838 [myid:1] - DEBUG [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161]
- Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483d zxid:0xb3ea0ee8f4dd
txntype:-1 reqpath:n/a
>     2015-01-12 14:05:12,839 [myid:1] - DEBUG [CommitProcessor:1:FinalRequestProcessor@88]
- Processing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483d zxid:0xb3ea0ee8f4dd
txntype:-1 reqpath:n/a
>     2015-01-12 14:05:12,839 [myid:1] - TRACE [CommitProcessor:1:ZooTrace@90] - :Esessionid:0x24aac647ab40047
type:error cxid:0x29df483d zxid:0xb3ea0ee8f4dd txntype:-1 reqpath:n/a
>     2015-01-12 14:05:12,839 [myid:1] - DEBUG [CommitProcessor:1:DataTree@949] - Ignoring
processTxn failure hdr: -1 : error: -110
>     2015-01-12 14:05:12,839 [myid:1] - DEBUG [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161]
- Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df4840 zxid:0xb3ea0ee8f4de
txntype:-1 reqpath:n/a
>     2015-01-12 14:05:12,839 [myid:1] - DEBUG [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161]
- Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df4843 zxid:0xb3ea0ee8f4df
txntype:-1 reqpath:n/a
> 
>  
>    

Mime
View raw message