kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Firth (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (KAFKA-2572) zk connection instability
Date Wed, 07 Oct 2015 14:39:26 GMT

     [ https://issues.apache.org/jira/browse/KAFKA-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

John Firth updated KAFKA-2572:
------------------------------
    Description: 
On two occasions, we've seen our process enter a cycle of: zk session expiry; new session
creation; rebalancing activity; pause during which nothing is heard from the zk server. Sometimes,
the reconnections are successful, elements are pulled from Kafka, but then disconnection and
reconnection occurs shortly thereafter, causing OOMs when new elements are pulled in. Restarting
the process that uses the zk client resolved the problems in both cases.

This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show all logs entries
minus entries particular to our application. For 09/08, the time span is 2015-09-08T12:52:06.069-04:00
to 2015-09-08T13:14:48.250-04:00; for 11/08, the time span is between 2015-09-11T01:38:17.000-04:00
to 2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining only error and
warning entries, and entries containing any of: "begin rebalancing", "end rebalancing", "timed",
and "zookeeper state". For the 09/11 digest logs, entries from the kafka.network.Processor
logger are also excised for clarity. Unfortunately, debug logging was not enabled during these
events.

The 09/11 case shows repeated cycles of session expiry, followed by rebalancing activity,
followed by a pause during which nothing is heard from the zk server, followed by a session
timeout. A stable session seems to have been established at 2015-09-11T04:13:47.140-04:00,
but messages of the form "I wrote this conflicted ephemeral node [{"version":1,"subscription":{"binlogs_mailchimp_us2":100},"pattern":"static","timestamp":"1441959227564"}]
at /consumers/prologue-second-stage_prod_us2/ids/prologue-second-stage_prod_us2_app01.c1.prologue.prod.atl01.rsglab.com-1441812334972-b967b718
a while back in a different session, hence I will backoff for this node to be deleted by Zookeeper
and retry" were logged out repeatedly until we restarted the process after 2015-09-11T07:44:47.124-04:00,
which marks the final entry in the log.

The 09/08 case is a little more straightforward than the 09/11 case, in that a stable session
was not established prior to our restarting the process.
It's perhaps also noteworthy that in the 09/08 case, two timeouts for the same session are
seen during a single rebalance, at 2015-09-08T12:52:19.107-04:00 and 2015-09-08T12:52:31.639-04:00.
The rebalance in question begins at 2015-09-08T12:52:06.667-04:00.
The connection to ZK expires and is restablished multiple times before the process is killed
after 2015-09-08T13:13:41.655-04:00, which marks the last entry in the logs for this day.

  was:
On two occasions, we've seen our process enter a cycle of: zk session expiry; new session
creation; rebalancing activity; pause during which nothing is heard from the zk server. Restarting
the process using the zk client resolved the problems in both cases.

This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show all logs entries
minus entries particular to our application. For 09/08, the time span is 2015-09-08T12:52:06.069-04:00
to 2015-09-08T13:14:48.250-04:00; for 11/08, the time span is between 2015-09-11T01:38:17.000-04:00
to 2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining only error and
warning entries, and entries containing any of: "begin rebalancing", "end rebalancing", "timed",
and "zookeeper state". For the 09/11 digest logs, entries from the kafka.network.Processor
logger are also excised for clarity. Unfortunately, debug logging was not enabled during these
events.

The 09/11 case shows repeated cycles of session expiry, followed by rebalancing activity,
followed by a pause during which nothing is heard from the zk server, followed by a session
timeout. A stable session seems to have been established at 2015-09-11T04:13:47.140-04:00,
but messages of the form "I wrote this conflicted ephemeral node [{"version":1,"subscription":{"binlogs_mailchimp_us2":100},"pattern":"static","timestamp":"1441959227564"}]
at /consumers/prologue-second-stage_prod_us2/ids/prologue-second-stage_prod_us2_app01.c1.prologue.prod.atl01.rsglab.com-1441812334972-b967b718
a while back in a different session, hence I will backoff for this node to be deleted by Zookeeper
and retry" were logged out repeatedly until we restarted the process after 2015-09-11T07:44:47.124-04:00,
which marks the final entry in the log.

The 09/08 case is a little more straightforward than the 09/11 case, in that a stable session
was not established prior to our restarting the process.
It's perhaps also noteworthy that in the 09/08 case, two timeouts for the same session are
seen during a single rebalance, at 2015-09-08T12:52:19.107-04:00 and 2015-09-08T12:52:31.639-04:00.
The rebalance in question begins at 2015-09-08T12:52:06.667-04:00.
The connection to ZK expires and is restablished multiple times before the process is killed
after 2015-09-08T13:13:41.655-04:00, which marks the last entry in the logs for this day.


> zk connection instability
> -------------------------
>
>                 Key: KAFKA-2572
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2572
>             Project: Kafka
>          Issue Type: Bug
>          Components: zkclient
>    Affects Versions: 0.8.2.1
>         Environment: zk version 3.4.6,
> CentOS 6, 2.6.32-504.1.3.el6.x86_64
>            Reporter: John Firth
>         Attachments: 090815-digest.log, 090815-full.log, 091115-digest.log, 091115-full.log.zip
>
>
> On two occasions, we've seen our process enter a cycle of: zk session expiry; new session
creation; rebalancing activity; pause during which nothing is heard from the zk server. Sometimes,
the reconnections are successful, elements are pulled from Kafka, but then disconnection and
reconnection occurs shortly thereafter, causing OOMs when new elements are pulled in. Restarting
the process that uses the zk client resolved the problems in both cases.
> This behavior was seen on 09/08 and 09/11 -- the attached 'full' logs show all logs entries
minus entries particular to our application. For 09/08, the time span is 2015-09-08T12:52:06.069-04:00
to 2015-09-08T13:14:48.250-04:00; for 11/08, the time span is between 2015-09-11T01:38:17.000-04:00
to 2015-09-11T07:44:47.124-04:00. The digest logs are the result of retaining only error and
warning entries, and entries containing any of: "begin rebalancing", "end rebalancing", "timed",
and "zookeeper state". For the 09/11 digest logs, entries from the kafka.network.Processor
logger are also excised for clarity. Unfortunately, debug logging was not enabled during these
events.
> The 09/11 case shows repeated cycles of session expiry, followed by rebalancing activity,
followed by a pause during which nothing is heard from the zk server, followed by a session
timeout. A stable session seems to have been established at 2015-09-11T04:13:47.140-04:00,
but messages of the form "I wrote this conflicted ephemeral node [{"version":1,"subscription":{"binlogs_mailchimp_us2":100},"pattern":"static","timestamp":"1441959227564"}]
at /consumers/prologue-second-stage_prod_us2/ids/prologue-second-stage_prod_us2_app01.c1.prologue.prod.atl01.rsglab.com-1441812334972-b967b718
a while back in a different session, hence I will backoff for this node to be deleted by Zookeeper
and retry" were logged out repeatedly until we restarted the process after 2015-09-11T07:44:47.124-04:00,
which marks the final entry in the log.
> The 09/08 case is a little more straightforward than the 09/11 case, in that a stable
session was not established prior to our restarting the process.
> It's perhaps also noteworthy that in the 09/08 case, two timeouts for the same session
are seen during a single rebalance, at 2015-09-08T12:52:19.107-04:00 and 2015-09-08T12:52:31.639-04:00.
The rebalance in question begins at 2015-09-08T12:52:06.667-04:00.
> The connection to ZK expires and is restablished multiple times before the process is
killed after 2015-09-08T13:13:41.655-04:00, which marks the last entry in the logs for this
day.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message