From dev-return-48528-archive-asf-public=cust-asf.ponee.io@couchdb.apache.org  Tue Apr 16 00:41:03 2019
Return-Path: <dev-return-48528-archive-asf-public=cust-asf.ponee.io@couchdb.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id EF2DF18064C
	for <archive-asf-public@cust-asf.ponee.io>; Tue, 16 Apr 2019 02:41:02 +0200 (CEST)
Received: (qmail 14146 invoked by uid 500); 16 Apr 2019 00:41:02 -0000
Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@couchdb.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@couchdb.apache.org>
List-Post: <mailto:dev@couchdb.apache.org>
List-Id: <dev.couchdb.apache.org>
Reply-To: dev@couchdb.apache.org
Delivered-To: mailing list dev@couchdb.apache.org
Received: (qmail 14134 invoked by uid 99); 16 Apr 2019 00:41:01 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Apr 2019 00:41:01 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 1C893C06C0
	for <dev@couchdb.apache.org>; Tue, 16 Apr 2019 00:41:01 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 0.272
X-Spam-Level:
X-Spam-Status: No, score=0.272 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_LOW=-0.7,
	SPF_HELO_PASS=-0.001, SPF_SOFTFAIL=0.972, URIBL_BLOCKED=0.001]
	autolearn=disabled
Authentication-Results: spamd4-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=messagingengine.com
Received: from mx1-lw-us.apache.org ([10.40.0.8])
	by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024)
	with ESMTP id 2OCmM77oYdzS for <dev@couchdb.apache.org>;
	Tue, 16 Apr 2019 00:40:59 +0000 (UTC)
Received: from out4-smtp.messagingengine.com (out4-smtp.messagingengine.com [66.111.4.28])
	by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 6B9BA5FD3D
	for <dev@couchdb.apache.org>; Tue, 16 Apr 2019 00:40:59 +0000 (UTC)
Received: from compute4.internal (compute4.nyi.internal [10.202.2.44])
	by mailout.nyi.internal (Postfix) with ESMTP id 2124B255DA
	for <dev@couchdb.apache.org>; Mon, 15 Apr 2019 20:40:59 -0400 (EDT)
Received: from mailfrontend1 ([10.202.2.162])
  by compute4.internal (MEProxy); Mon, 15 Apr 2019 20:40:59 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=
	messagingengine.com; h=content-transfer-encoding:content-type
	:date:from:message-id:mime-version:subject:to:x-me-proxy
	:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm2; bh=aQbJXx
	PClWdkoV21+7egnQUXo3HFaJ9VmgfbkOaj2Jo=; b=FMIWusiqK/E0Qu2sXEUxXS
	RNCbz+Q8+YxZBPF/WOjydPpM3Na1ykXII6ED7z8yOrR1a/6pIlLimCkLIf35PjRy
	TZwuexFn/bPV4r4X4VATEYN7H/8k4ow7R/DR5hV0P4rpKeSWxkjyeIifjcWKOvF6
	RUmlwr8QHNg2CDubnCzf44T02Yg7NiI/gCXUa3mnIrZ2u++rBXuPuIPQ8IdKXJGk
	GSXwrS6RNZ17JRenqlfyJNNxV0iWG3O1RvHL5PKaW4HPMNSpw/NKTKwXA7al2gW9
	l/z41nCLkfCHiY8yXC2VlsOB1JNbSJdUHa3Pda+jZZUmuk24zy3e0cVON5x2sB4g
	==
X-ME-Sender: <xms:miS1XAJ9EIzejhZ7i8huSk6MBEu3tH6j1xc8cLjSsUYwc5lgzSCuSw>
X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeduuddrfedtgdefiecutefuodetggdotefrodftvf
    curfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfghnecu
    uegrihhlohhuthemuceftddtnecunecujfgurhephfgtgfgguffkfffvofesthhqmhdthh
    dtjeenucfhrhhomheptegurghmucfmohgtohhlohhskhhiuceokhhotgholhhoshhksegr
    phgrtghhvgdrohhrgheqnecukfhppeejfedrudegledrvddvfedrudehgeenucfrrghrrg
    hmpehmrghilhhfrhhomhepkhhotgholhhoshhksegrphgrtghhvgdrohhrghenucevlhhu
    shhtvghrufhiiigvpedt
X-ME-Proxy: <xmx:miS1XBaHBgmjoyncd9ay2NGUkGywqhNEIC1t2WBX4tRx3zzMd5rIYQ>
    <xmx:miS1XGADDZ8PRcSGSSX7t-TxGeAzWFPXJxFOgaxlbFK7ZlWBsxw4VA>
    <xmx:miS1XLVBS5XnAPuzLK7DE2cT2dweqFR4CRiR-Vm2KXXgn8X7nwkFug>
    <xmx:myS1XPqfOZl3UFVmkW_FIX7cy_mFfKMiib_O4-11GVmjaJ6nN3xMxA>
Received: from [192.168.1.160] (c-73-149-223-154.hsd1.ma.comcast.net [73.149.223.154])
	by mail.messagingengine.com (Postfix) with ESMTPA id 8901DE41A1
	for <dev@couchdb.apache.org>; Mon, 15 Apr 2019 20:40:58 -0400 (EDT)
From: Adam Kocoloski <kocolosk@apache.org>
Content-Type: text/plain;
	charset=utf-8
Content-Transfer-Encoding: quoted-printable
Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.8\))
Subject: [DISCUSS] Improve load shedding by enforcing timeouts throughout
 stack
Message-Id: <81B09DB8-D192-4319-A3FA-4C123C206F33@apache.org>
Date: Mon, 15 Apr 2019 20:40:57 -0400
To: CouchDB Developers <dev@couchdb.apache.org>
X-Mailer: Apple Mail (2.3445.104.8)

Hi all,

For once, I=E2=80=99m coming to you with a topic that is not strictly =
about FoundationDB :)

CouchDB offers a few config settings (some of them undocumented) to put =
a limit on how long the server is allowed to take to generate a =
response. The trouble with many of these timeouts is that, when they =
fire, they do not actually clean up all of the work that they initiated. =
A couple of examples:

- Each HTTP response coordinated by the =E2=80=9Cfabric=E2=80=9D =
application spawns several ephemeral processes via =E2=80=9Crexi" on =
different nodes in the cluster to retrieve data and send it back to the =
process coordinating the response. If the request timeout fires, the =
coordinating process will be killed off, but the ephemeral workers might =
not be. In a healthy cluster they=E2=80=99ll exit on their own when they =
finish their jobs, but there are conditions under which they can sit =
around for extended periods of time waiting for an overloaded gen_server =
(e.g. couch_server) to respond.

- Those named gen_servers (like couch_server) responsible for =
serializing access to important data structures will dutifully process =
messages received from old requests without any regard for (of even =
knowledge of) the fact that the client that sent the message timed out =
long ago. This can lead to a sort of death spiral in which the =
gen_server is ultimately spending ~all of its time serving dead clients =
and every client is timing out.

I=E2=80=99d like to see us introduce a documented maximum request =
duration for all requests except the _changes feed, and then use that =
information to aid in load shedding throughout the stack. We can audit =
the codebase for gen_server calls with long timeouts (I know of a few on =
the critical path that set their timeouts to `infinity`) and we can =
design servers that efficiently drop old requests, knowing that the =
client who made the request must have timed out. A couple of topics for =
discussion:

- the =E2=80=9Cgen_server that sheds old requests=E2=80=9D is a very =
generic pattern, one that seems like it could be well-suited to its own =
behaviour. A cursory search of the internet didn=E2=80=99t turn up any =
prior art here, which surprises me a bit. I=E2=80=99m wondering if this =
is worth bringing up with the broader Erlang community.

- setting and enforcing timeouts is a healthy pattern for read-only =
requests as it gives a lot more feedback to clients about the health of =
the server. When it comes to updates things are a little bit more muddy, =
just because there remains a chance that an update can be committed, but =
the caller times out before learning of the successful commit. We should =
try to minimize the likelihood of that occurring.

Cheers, Adam

P.S. I did say that this wasn=E2=80=99t _strictly_ about FoundationDB, =
but of course FDB has a hard 5 second limit on all transactions, so it =
is a bit of a forcing function :).Even putting FoundationDB aside, I =
would still argue to pursue this path based on our Ops experience with =
the current codebase.=