zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "R. Tyler Croy" <ty...@monkeypox.org>
Subject Large number of errored transactions and periodic node disconnects on 3.4.6
Date Wed, 14 Jan 2015 16:32:18 GMT

A colleague of mine and I are still fully trying to understand the behavior
we're seeing from one of our Zookeeper clusters running 3.4.6.

Some log snippets are included further down, but the symptom of what we're
seeing is that nodes in the cluster periodically detach themselves entirely
from the rest of the cluster. The period is not however consistent, in at least
one known case we've seen multiple nodes detach around the same time causing
leader re-election challenges (putting it kindly!).

Our investigation has led us to see a large number of transaction errors
(txnType: -1) in our trace logs with "n/a" paths. To give you an idea of how
large of a number we're talking about, here's a summary from ~20min of logs:

-> # grep txntype zookeeper.log* | perl -pe 's/.*txntype/txntype/' | sort | uniq -c | sort
-nr
70585 txntype:-1 reqpath:n/a
4860 txntype:5 reqpath:n/a
7 txntype:-11 reqpath:n/a
7 txntype:-10 reqpath:n/a
[..snip..]

The number of session create/close transaction types looks about right as does
the number of 'set data' transactions; however the "reqpath" being "n/a" does
not look entirely correct.


We're only using Zookeeper to hold offsets for normal Kafka high-level
consumers and Storm's KafkaSpout consumers, and I'm relatively confident that
those consumers are behaving correctly.


We have also come across a number of ostensibly resolved bugs in JIRA that, to
our untrained eyes, look similar to what we're seeing:

    "Zookeeper 3.3.4 loses ephemeral nodes under stress"
        <https://issues.apache.org/jira/browse/ZOOKEEPER-1740>
    "the threads number of a zookeeper is increased all the time"
        <https://issues.apache.org/jira/browse/ZOOKEEPER-939> (not seeing the
        thread growth mentioned here, just log similarities)


The node detaching issue is not something we've been able to recreate on our
own, so I don't entirely have a reproduction case. This is ZK 3.4.6 on OpenJDK
1.7.0u67 running on Ubuntu 12.04, other than that I'm happy to provide
information that might help determine the cause of our problems!


Cheers (logs snippets to follow)


Error logs from an incident where a node detached itself:

    2015-01-02 20:10:37,576 [myid:4] - ERROR [LearnerHandler-/10.96.100.69:54260:LearnerHandler@633]
- Unexpected exception causing shutdown while sock still open
    java.net.SocketTimeoutException: Read timed out
            at java.net.SocketInputStream.socketRead0(Native Method)
            at java.net.SocketInputStream.read(SocketInputStream.java:152)
            at java.net.SocketInputStream.read(SocketInputStream.java:122)
            at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
            at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
            at java.io.DataInputStream.readInt(DataInputStream.java:387)
            at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
            at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
            at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
            at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
    2015-01-02 20:10:37,577 [myid:4] - ERROR [LearnerHandler-/10.96.100.90:43220:LearnerHandler@633]
- Unexpected exception causing shutdown while sock still open
    java.net.SocketTimeoutException: Read timed out
            at java.net.SocketInputStream.socketRead0(Native Method)
            at java.net.SocketInputStream.read(SocketInputStream.java:152)
            at java.net.SocketInputStream.read(SocketInputStream.java:122)
            at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
            at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
            at java.io.DataInputStream.readInt(DataInputStream.java:387)
            at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
            at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
            at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
            at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
    2015-01-02 20:10:37,577 [myid:4] - WARN  [LearnerHandler-/10.96.100.69:54260:LearnerHandler@646]
- ******* GOODBYE /10.96.100.69:54260 ********
    2015-01-02 20:10:37,576 [myid:4] - ERROR [LearnerHandler-/10.96.100.136:48175:LearnerHandler@633]
- Unexpected exception causing shutdown while sock still open
    java.net.SocketTimeoutException: Read timed out
            at java.net.SocketInputStream.socketRead0(Native Method)
            at java.net.SocketInputStream.read(SocketInputStream.java:152)
            at java.net.SocketInputStream.read(SocketInputStream.java:122)
            at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
            at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
            at java.io.DataInputStream.readInt(DataInputStream.java:387)
            at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
            at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
            at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
            at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
    2015-01-02 20:10:37,578 [myid:4] - WARN  [LearnerHandler-/10.96.14.103:47982:LearnerHandler@646]
- ******* GOODBYE /10.96.14.103:47982 ********
    2015-01-02 20:10:37,577 [myid:4] - ERROR [LearnerHandler-/10.84.11.131:47798:LearnerHandler@633]
- Unexpected exception causing shutdown while sock still open
    java.net.SocketTimeoutException: Read timed out
            at java.net.SocketInputStream.socketRead0(Native Method)
            at java.net.SocketInputStream.read(SocketInputStream.java:152)
            at java.net.SocketInputStream.read(SocketInputStream.java:122)
            at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
            at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
            at java.io.DataInputStream.readInt(DataInputStream.java:387)
            at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
            at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
            at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
            at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:546)
    2015-01-02 20:10:37,578 [myid:4] - WARN  [LearnerHandler-/10.84.11.131:47798:LearnerHandler@646]
- ******* GOODBYE /10.84.11.131:47798 ********
    2015-01-02 20:10:37,578 [myid:4] - WARN  [LearnerHandler-/10.96.100.136:48175:LearnerHandler@646]
- ******* GOODBYE /10.96.100.136:48175 ********



Some trace logs from "normal operation" that looked suspect to us:


    2015-01-12 14:05:12,753 [myid:1] - TRACE [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooTrace@71]
- loadData — session in archive: 381306787892823513 with timeout: 6000
    2015-01-12 14:05:12,797 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@240]
- Snapshotting: 0xb3ea0ee8f4db to /srv/zookeeper/snapshots/version-2/snapshot.b3ea0ee8f4db
    2015-01-12 14:05:12,822 [myid:1] - WARN [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Learner@374]
- Got zxid 0xb3ea0ee8f4dc expected 0x1
    2015-01-12 14:05:12,833 [myid:1] - DEBUG [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161]
- Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483a zxid:0xb3ea0ee8f4dc
txntype:-1 reqpath:n/a
    2015-01-12 14:05:12,833 [myid:1] - DEBUG [CommitProcessor:1:FinalRequestProcessor@88]
- Processing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483a zxid:0xb3ea0ee8f4dc
txntype:-1 reqpath:n/a
    2015-01-12 14:05:12,838 [myid:1] - TRACE [CommitProcessor:1:ZooTrace@90] - :Esessionid:0x24aac647ab40047
type:error cxid:0x29df483a zxid:0xb3ea0ee8f4dc txntype:-1 reqpath:n/a
    2015-01-12 14:05:12,838 [myid:1] - DEBUG [CommitProcessor:1:DataTree@949] - Ignoring processTxn
failure hdr: -1 : error: -110
    2015-01-12 14:05:12,838 [myid:1] - DEBUG [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161]
- Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483d zxid:0xb3ea0ee8f4dd
txntype:-1 reqpath:n/a
    2015-01-12 14:05:12,839 [myid:1] - DEBUG [CommitProcessor:1:FinalRequestProcessor@88]
- Processing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df483d zxid:0xb3ea0ee8f4dd
txntype:-1 reqpath:n/a
    2015-01-12 14:05:12,839 [myid:1] - TRACE [CommitProcessor:1:ZooTrace@90] - :Esessionid:0x24aac647ab40047
type:error cxid:0x29df483d zxid:0xb3ea0ee8f4dd txntype:-1 reqpath:n/a
    2015-01-12 14:05:12,839 [myid:1] - DEBUG [CommitProcessor:1:DataTree@949] - Ignoring processTxn
failure hdr: -1 : error: -110
    2015-01-12 14:05:12,839 [myid:1] - DEBUG [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161]
- Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df4840 zxid:0xb3ea0ee8f4de
txntype:-1 reqpath:n/a
    2015-01-12 14:05:12,839 [myid:1] - DEBUG [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@161]
- Committing request:: sessionid:0x24aac647ab40047 type:error cxid:0x29df4843 zxid:0xb3ea0ee8f4df
txntype:-1 reqpath:n/a

Mime
View raw message