Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id D600D200B9B for ; Wed, 12 Oct 2016 22:06:36 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id D47B1160AD4; Wed, 12 Oct 2016 20:06:36 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 29D32160ACA for ; Wed, 12 Oct 2016 22:06:36 +0200 (CEST) Received: (qmail 68477 invoked by uid 500); 12 Oct 2016 20:06:35 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 68346 invoked by uid 99); 12 Oct 2016 20:06:34 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Oct 2016 20:06:34 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id D9333C7035 for ; Wed, 12 Oct 2016 19:57:57 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.379 X-Spam-Level: X-Spam-Status: No, score=0.379 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=dropbox.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id 6iwgAOs6nhyY for ; Wed, 12 Oct 2016 19:57:55 +0000 (UTC) Received: from mail-qk0-f178.google.com (mail-qk0-f178.google.com [209.85.220.178]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 7759F5FADB for ; Wed, 12 Oct 2016 19:57:55 +0000 (UTC) Received: by mail-qk0-f178.google.com with SMTP id z190so52245877qkc.2 for ; Wed, 12 Oct 2016 12:57:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=dropbox.com; s=corp; h=mime-version:from:date:message-id:subject:to; bh=uO1QTjuT7H9i6hUZFxjjLEDnMPKk4XgAqYtM/foK+sU=; b=stjo8nku+65q7r5fE6NXh4Y0u1ZLTxIUbo/OiwEwnH6K5SYsekDzk+ZgONUpkaseZK coowAHOZv13aRo3BBRScA0/Q7xNhktytsFdjd9KFhrjo0HcD5eIeMdfpGSS9v1R/jpDr mj5y7r9tn7c5dsV6IYptyk2II53vvtYpaaQsM= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=uO1QTjuT7H9i6hUZFxjjLEDnMPKk4XgAqYtM/foK+sU=; b=SU3/jishGBHUdj1qSLZokcI+8r+I1RY1nrBYk81cD6JfZCUtS2nHH8ZwbTC69rIGc4 mdpLGKR2STBJK3Z4muj5R6mg0NEj1YnbLLdjuYoLeRZ/RajjfkIl2SfK2JSr9Z1PXCbv XHeObCa3YWR4HcmzCj4abnW0ihlSLVCFkI+Vv3R2Q2V7dZxOMMB8yuAfkw2BuyKvWoLX B8SXGazQ+JHVYmiMWy/4w2XTomfPDpGUfgU8wWv+prbhlTeIcAbjI2Cgo5p6mHQfkRsj IfdwpQ0vFOXPAyXqGX25TGFCHMheO0za5A3DnJtvPO1YkP5oyqhCOJBCaRQpTjCsIEMH hKYg== X-Gm-Message-State: AA6/9RlxPdDk8quBN4U2aY5WpuhCXrKfaB9NF65+vBEe9UuO1QJOrv+IkSJlwl/77niY2VPjxGw8K3QVbqUpeHor X-Received: by 10.55.151.70 with SMTP id z67mr3351418qkd.185.1476302274917; Wed, 12 Oct 2016 12:57:54 -0700 (PDT) MIME-Version: 1.0 Received: by 10.12.133.33 with HTTP; Wed, 12 Oct 2016 12:57:54 -0700 (PDT) From: Mike Solomon Date: Wed, 12 Oct 2016 12:57:54 -0700 Message-ID: Subject: outstandingChanges queue grows without bound To: user@zookeeper.apache.org Content-Type: text/plain; charset=UTF-8 archived-at: Wed, 12 Oct 2016 20:06:37 -0000 I've been performance testing 3.5.2 and hit an interesting unavailability issue. When there server is very busy (64k connections, 16k writes per second) the leader can get busy enough that connections get throttled. Enough throttling causes sessions to expire. As sessions expire, the CPU consumption rises and the quorum is effectively unavailable. Interestingly, if you shut down all the clients, the quorum won't heal for nearly 10 minutes. The issue is that the outstandingChanges queue has 250k items in it and the closeSession code scans this linearly under a lock. Replacing the linear scan with a hash table lookup improves this, but likely the real solution is some backpressure on clients as a result of an oversized outstandingChanges queue. Here is a sample fix: https://github.com/msolo/zookeeper/commit/75da352d506c2e3b0001d28acc058c422b3c8f0c This results in the quorum healing about 30 seconds after the clients disconnect. Is there a way to prevent runaway growth in this queue? I'm wondering if changing the definition of "throttling" to take into account the size of this queue might help mitigate this. The end goal is that some stable amount of traffic is reached asymptotically without suffering a collapse. Thanks, -Mike