Return-Path: X-Original-To: apmail-kafka-dev-archive@www.apache.org Delivered-To: apmail-kafka-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A077BEFC1 for ; Tue, 19 Feb 2013 21:07:14 +0000 (UTC) Received: (qmail 15264 invoked by uid 500); 19 Feb 2013 21:07:14 -0000 Delivered-To: apmail-kafka-dev-archive@kafka.apache.org Received: (qmail 15190 invoked by uid 500); 19 Feb 2013 21:07:14 -0000 Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@kafka.apache.org Delivered-To: mailing list dev@kafka.apache.org Received: (qmail 15091 invoked by uid 99); 19 Feb 2013 21:07:14 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Feb 2013 21:07:14 +0000 Date: Tue, 19 Feb 2013 21:07:14 +0000 (UTC) From: "Bob Cotton (JIRA)" To: dev@kafka.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (KAFKA-764) Race Condition in Broker Registration after ZooKeeper disconnect MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Bob Cotton created KAFKA-764: -------------------------------- Summary: Race Condition in Broker Registration after ZooKeeper disconnect Key: KAFKA-764 URL: https://issues.apache.org/jira/browse/KAFKA-764 Project: Kafka Issue Type: Bug Affects Versions: 0.7.1 Reporter: Bob Cotton When running our ZooKeepers in VMware, occasionally all the keepers simultaneously pause long enough for the Kafka clients to time out and then the keepers simultaneously un-pause. When this happens, the zk clients disconnect from ZooKeeper. When ZooKeeper comes back ZkUtils.createEphemeralPathExpectConflict discovers the node id of itself and does not re-register the broker id node and the function call succeeds. Then ZooKeeper figures out the broker disconnected from the keeper and deletes the ephemeral node *after* allowing the consumer to read the data in the /brokers/ids/x node. The broker then goes on to register all the topics, etc. When consumers connect, they see topic nodes associated with the broker but thy can't find the broker node to get connection information for the broker, sending them into a rebalance loop until they reach rebalance.retries.max and fail. This might also be a ZooKeeper issue, but the desired behavior for a disconnect case might be, if the broker node is found to explicitly delete and recreate it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira