From dev-return-82227-apmail-zookeeper-dev-archive=zookeeper.apache.org@zookeeper.apache.org Fri Aug 2 18:03:50 2019 Return-Path: X-Original-To: apmail-zookeeper-dev-archive@www.apache.org Delivered-To: apmail-zookeeper-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by minotaur.apache.org (Postfix) with SMTP id F14A5198C7 for ; Fri, 2 Aug 2019 18:03:49 +0000 (UTC) Received: (qmail 78918 invoked by uid 500); 2 Aug 2019 18:03:48 -0000 Delivered-To: apmail-zookeeper-dev-archive@zookeeper.apache.org Received: (qmail 78865 invoked by uid 500); 2 Aug 2019 18:03:48 -0000 Mailing-List: contact dev-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@zookeeper.apache.org Delivered-To: mailing list dev@zookeeper.apache.org Received: (qmail 78851 invoked by uid 99); 2 Aug 2019 18:03:48 -0000 Received: from Unknown (HELO mailrelay1-lw-us.apache.org) (10.10.3.159) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Aug 2019 18:03:48 +0000 Received: from mail-ed1-f53.google.com (mail-ed1-f53.google.com [209.85.208.53]) by mailrelay1-lw-us.apache.org (ASF Mail Server at mailrelay1-lw-us.apache.org) with ESMTPSA id 74ACD236 for ; Fri, 2 Aug 2019 18:03:47 +0000 (UTC) Received: by mail-ed1-f53.google.com with SMTP id x19so67293764eda.12 for ; Fri, 02 Aug 2019 11:03:47 -0700 (PDT) X-Gm-Message-State: APjAAAXJ4527MYb67RtZsjDPeSVEMCVxZ2MG5gXOM0o3MqC9I8x+L0J+ zT961n5eKzcxcULheNh55wqAucDxJOYLwoVh4n0= X-Google-Smtp-Source: APXvYqxFArF2vvT6syGbkQnbaWEeCVcf7Mfd24pqIpX+2tgxTo1Onkb2iKvLl0m1tqRBDJM61+mUYHGJIMnot+ellTA= X-Received: by 2002:a17:906:489a:: with SMTP id v26mr107300254ejq.234.1564769026487; Fri, 02 Aug 2019 11:03:46 -0700 (PDT) MIME-Version: 1.0 From: Michael Han Date: Fri, 2 Aug 2019 11:03:35 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum To: dev@zookeeper.apache.org Content-Type: multipart/alternative; boundary="000000000000fabd52058f262d4b" --000000000000fabd52058f262d4b Content-Type: text/plain; charset="UTF-8" Folks, Some of you might already see this. Comments? https://cwiki.apache.org/confluence/display/KAFKA/KIP-500%3A+Replace+ZooKeeper+with+a+Self-Managed+Metadata+Quorum What caught my eyes are: *Worse still, although ZooKeeper is the store of record, the state in ZooKeeper often doesn't match the state that is held in memory in the controller. For example, when a partition leader changes its ISR in ZK, the controller will typically not learn about these changes for many seconds. There is no generic way for the controller to follow the ZooKeeper event log. Although the controller can set one-shot watches, the number of watches is limited for performance reasons. When a watch triggers, it doesn't tell the controller the current state-- only that the state has changed. By the time the controller re-reads the znode and sets up a new watch, the state may have changed from what it was when the watch originally fired. If there is no watch set, the controller may not learn about the change at all. In some cases, restarting the controller is the only way to resolve the discrepancy.* I've seen some similar zookeeper use cases that ended up like what's described here. How can ZooKeeper solve this? It seems to me that the only solution is to provide linearizable read on watched operations. Thoughts? Michael. --000000000000fabd52058f262d4b--