zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Lent <jl...@digitalsmiths.com>
Subject Question Regarding Ephemeral Nodes And New Sessions (Related To KAFKA-1387)
Date Thu, 02 Oct 2014 16:33:29 GMT
I have run into the issue documented by KAFKA-1387 and have been trying to come up with a solution.
 A summary of this issue is:

  *   Kafka registers brokers and consumers via ephemeral ZooKeeper nodes.
  *   When a connection fails and an Expire event is received Kafka reconnects and then attempts
to recreate these ephemeral nodes.
  *   If the node still/already exists when the Kafka attempts to recreate it Kafka currently
assumes (I am working with that ZooKeeper is just slow deleting the node from the
old session and therefore goes into a delay loop waiting for ZooKeeper to remove the stale
node so it can create a new ephemeral associated with the new session.
  *   In my stress testing I have seen cases where the connection can fail multiple times
in a short period of time and if one of these failures occurs while handling the Expire event
Kafka can end up with a backlog of two or more Expire events.  When the first of these finally
gets processed it recreates the node against the latest session.  However, when the next one
is processed the Kafka broker or consumer goes into a never ending delay loop waiting for
the stable node to go away.  This will not happen unless the connection fails again, but,
then the process just repeats itself.

I proposed a fix in the KAFKA-1387 Jira issue to generate some discussion of potential fixes
for this issue. One of the Kafka developers requested that I vet the basic assumption of my
fix with the ZooKeeper team.  My solution is basically:

  *   Register (via ZkClient) for notifications of both session and node events
  *   When processing the Expire event:
     *   If the node does not exist then recreate the node (current behavior)
     *   If the node exists do nothing (no looping)
  *   When processing a delete node event:
     *   If the node does not exist then recreate the node (new behavior)
     *   If the node exists do nothing

The basic assumption is that:

    "In the rare case where the node still exists from the previous session when the Expire
message is processed then we can be confident that we will be notified later when the node
is finally deleted."

In my testing I have seen:

  *   If I recreate the node while handling the Expire I do not later get a delete message
(for the already deleted node).
  *   If I do nothing when I process the Expire (to partially simulate a slow ZooKeeper delete)
then I do get a delete message for the old node (which was actually deleted before I processed
the Expire message).

I would greatly appreciate your insights on this issue.  For more details you can see the
Kafka issue.

James Lent
Senior Software Engineer

A TiVo Company

jlent@digitalsmiths.com<mailto:jlent@digitalsmiths.com>  | office 919.460.4747


This email and any attachments may contain confidential and privileged material for the sole
use of the intended recipient. Any review, copying, or distribution of this email (or any
attachments) by others is prohibited. If you are not the intended recipient, please contact
the sender immediately and permanently delete this email and any attachments. No employee
or agent of TiVo Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc.
by email. Binding agreements with TiVo Inc. may only be made by a signed written agreement.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message