zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yicheng Fang (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (ZOOKEEPER-2899) Zookeeper not receiving packets after ZXID overflows
Date Fri, 15 Sep 2017 18:00:01 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16168208#comment-16168208
] 

Yicheng Fang edited comment on ZOOKEEPER-2899 at 9/15/17 5:59 PM:
------------------------------------------------------------------

ZXID overflowed in prod:

We observed that the ensemble was not receiving any packets during the time of outage, as
can be seen in the attachment 'image12.pnp'. It was a grafana graph, with data source from
the four-letter word commands. In the meantime, node count dropped by ~10000 and stayed flat
at 302,500 after the overflow. The aggregated log is attached as 'zk_20170309_wo_noise.log',
which seems to tell that the leader election was finished successfully, quorum formed, and
ZK servers started up.

However, we did see a lot of the following errors after the ZK servers went up:
{noformat}
2017-03-09 09:00:12,420 - ERROR [CommitProcessor:2:NIOServerCnxn@180] - Unexpected Exception:
java.nio.channels.CancelledKeyException
        at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
        at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
        at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:153)
        at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1076)
        at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:404)
        at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74)


2017-03-09 09:00:13,210 - ERROR [CommitProcessor:1:NIOServerCnxn@180] - Unexpected Exception:
java.nio.channels.CancelledKeyException
        at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
        at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
        at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:153)
        at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1076)
        at org.apache.zookeeper.server.NIOServerCnxn.process(NIOServerCnxn.java:1113)
        at org.apache.zookeeper.server.WatchManager.triggerWatch(WatchManager.java:120)
        at org.apache.zookeeper.server.WatchManager.triggerWatch(WatchManager.java:92)
        at org.apache.zookeeper.server.DataTree.deleteNode(DataTree.java:594)
        at org.apache.zookeeper.server.DataTree.killSession(DataTree.java:966)
        at org.apache.zookeeper.server.DataTree.processTxn(DataTree.java:818)
        at org.apache.zookeeper.server.ZKDatabase.processTxn(ZKDatabase.java:329)
        at org.apache.zookeeper.server.ZooKeeperServer.processTxn(ZooKeeperServer.java:965)
        at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:116)
        at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74)
{noformat}

We mitigated the issue by restarting the ensemble, after which we see traffic flowing into
the ensemble and the whole system started recovering.


was (Author: eefangyicheng):
ZXID overflowed in prod:

We observed that the ensemble was not receiving any packets during the time of outage, as
can be seen in the attachment 'image12.pnp'. It was a grafana graph, with data source from
the four-letter word commands. In the meantime, node count dropped by ~10000 and stayed flat
at 302,500 after the overflow. The aggregated log is attached as 'zk_20170309_wo_noise.log',
which seems to tell the that leader election was finished successfully, quorum formed, and
ZK servers started up.

However, we did see a lot of the following errors after the ZK servers went up:
{noformat}
2017-03-09 09:00:12,420 - ERROR [CommitProcessor:2:NIOServerCnxn@180] - Unexpected Exception:
java.nio.channels.CancelledKeyException
        at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
        at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
        at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:153)
        at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1076)
        at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:404)
        at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74)


2017-03-09 09:00:13,210 - ERROR [CommitProcessor:1:NIOServerCnxn@180] - Unexpected Exception:
java.nio.channels.CancelledKeyException
        at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
        at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
        at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:153)
        at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1076)
        at org.apache.zookeeper.server.NIOServerCnxn.process(NIOServerCnxn.java:1113)
        at org.apache.zookeeper.server.WatchManager.triggerWatch(WatchManager.java:120)
        at org.apache.zookeeper.server.WatchManager.triggerWatch(WatchManager.java:92)
        at org.apache.zookeeper.server.DataTree.deleteNode(DataTree.java:594)
        at org.apache.zookeeper.server.DataTree.killSession(DataTree.java:966)
        at org.apache.zookeeper.server.DataTree.processTxn(DataTree.java:818)
        at org.apache.zookeeper.server.ZKDatabase.processTxn(ZKDatabase.java:329)
        at org.apache.zookeeper.server.ZooKeeperServer.processTxn(ZooKeeperServer.java:965)
        at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:116)
        at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74)
{noformat}

We mitigated the issue by restarting the ensemble, after which we see traffic flowing into
the ensemble and the whole system started recovering.

> Zookeeper not receiving packets after ZXID overflows
> ----------------------------------------------------
>
>                 Key: ZOOKEEPER-2899
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2899
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection
>    Affects Versions: 3.4.5
>         Environment: 5 host ensemble, 1500+ client connections each, 300K+ nodes
> OS: Ubuntu precise
> JAVA 7
> JuniperQFX510048T NIC, 10000Mb/s, ixgbe driver
> 6 core Intel(R)_Xeon(R)_CPU_E5-2620_v3_@_2.40GHz
> 4 HDD 600G each 
>            Reporter: Yicheng Fang
>         Attachments: GC_metric.png, image12.png, image13.png, message_in_per_sec.png,
metric_volume.png, zk_20170309_wo_noise.log
>
>
> ZK was used with Kafka (version 0.10.0) for coordination. We had a lot of Kafka consumers
writing  consumption offsets to ZK.
> We observed the issue two times within the last year. Each time after ZXID overflowed,
ZK was not receiving packets even though leader election looked successful from the logs,
and ZK servers were up. As a result, the whole Kafka system came to a halt.
> As an attempt to reproduce (and hopefully fixing) the issue, I set up test ZK and Kafka
clusters and feed them with like-production test traffic. Though not really able to reproduce
the issue, I did see that the Kafka consumers, which used ZK clients, essentially DOSed the
ensemble, filling up the `submittedRequests` in `PrepRequestProcessor`, causing even 100ms+
read latencies.
> More details are included in the comments.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message