From dev-return-48546-archive-asf-public=cust-asf.ponee.io@couchdb.apache.org  Fri Apr 26 20:19:52 2019
Return-Path: <dev-return-48546-archive-asf-public=cust-asf.ponee.io@couchdb.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 5340D18064C
	for <archive-asf-public@cust-asf.ponee.io>; Fri, 26 Apr 2019 22:19:52 +0200 (CEST)
Received: (qmail 69807 invoked by uid 500); 26 Apr 2019 20:19:51 -0000
Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@couchdb.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@couchdb.apache.org>
List-Post: <mailto:dev@couchdb.apache.org>
List-Id: <dev.couchdb.apache.org>
Reply-To: dev@couchdb.apache.org
Delivered-To: mailing list dev@couchdb.apache.org
Received: (qmail 69793 invoked by uid 99); 26 Apr 2019 20:19:51 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 26 Apr 2019 20:19:51 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 90F0218006B
	for <dev@couchdb.apache.org>; Fri, 26 Apr 2019 20:19:50 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 0.973
X-Spam-Level:
X-Spam-Status: No, score=0.973 tagged_above=-999 required=6.31
	tests=[SPF_SOFTFAIL=0.972, URIBL_BLOCKED=0.001] autolearn=disabled
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024)
	with ESMTP id 3LBIduhbWFRg for <dev@couchdb.apache.org>;
	Fri, 26 Apr 2019 20:19:48 +0000 (UTC)
Received: from smtp.justsomehost.net (smtp.justsomehost.net [204.11.51.157])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id DAB0A5F1BE
	for <dev@couchdb.apache.org>; Fri, 26 Apr 2019 20:19:47 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
	by smtp.justsomehost.net (Postfix) with ESMTP id D249B582055
	for <dev@couchdb.apache.org>; Fri, 26 Apr 2019 16:19:45 -0400 (EDT)
Received: from smtp.justsomehost.net ([127.0.0.1])
	by localhost (smtp.justsomehost.net [127.0.0.1]) (amavisd-new, port 10032)
	with ESMTP id O9OMcDAvGBtw for <dev@couchdb.apache.org>;
	Fri, 26 Apr 2019 16:19:44 -0400 (EDT)
Received: from localhost (localhost [127.0.0.1])
	by smtp.justsomehost.net (Postfix) with ESMTP id 93750582052
	for <dev@couchdb.apache.org>; Fri, 26 Apr 2019 16:19:44 -0400 (EDT)
X-Virus-Scanned: amavisd-new at smtp.justsomehost.net
Received: from smtp.justsomehost.net ([127.0.0.1])
	by localhost (smtp.justsomehost.net [127.0.0.1]) (amavisd-new, port 10026)
	with ESMTP id e_bLsM0sY8mR for <dev@couchdb.apache.org>;
	Fri, 26 Apr 2019 16:19:44 -0400 (EDT)
Received: from [192.168.1.42] (69-196-165-48.dsl.teksavvy.com [69.196.165.48])
	by smtp.justsomehost.net (Postfix) with ESMTPSA id 71F8A582051
	for <dev@couchdb.apache.org>; Fri, 26 Apr 2019 16:19:44 -0400 (EDT)
Subject: Re: [DISCUSS] Improve load shedding by enforcing timeouts throughout
 stack
To: dev@couchdb.apache.org
References: <81B09DB8-D192-4319-A3FA-4C123C206F33@apache.org>
 <CAOFTT0wkwLj6u+SPhMoRu8XVW79z3__r2sBrUG0YhkqzqGqCHQ@mail.gmail.com>
 <c0a009fd-c691-40b9-fafe-41e7459c7240@apache.org>
 <598D778A-8029-4845-9176-D1E192422FB1@apache.org>
 <20672508-37b9-4da6-a728-57435042cae4@www.fastmail.com>
 <CANC7JxAwEr0DJm=QTWGwQcjDzJuVmoyz6BNFSb81yS5A9wqiXg@mail.gmail.com>
 <F8BBDECC-90C8-4D72-AD6D-A6DDFD7E1C47@apache.org>
 <CAJd=5HYo+_Zz6RMSTwOEAy=+iJaHU_xX=pyTzDtcxEmUyYfMTw@mail.gmail.com>
 <6652f9b5-65db-4c45-aae8-caaa1204eb05@www.fastmail.com>
 <CAJd=5HYaHxS9ip+STrMNBXfaKEY8d0HN_7bjEYgQwsKaAWs04Q@mail.gmail.com>
 <F9E97529-2C38-41F2-88A0-E8AD2E65594A@apache.org>
From: Joan Touzet <wohali@apache.org>
Organization: Apache Software Foundation
Message-ID: <55285bb8-6eed-9654-b23f-329b2e404b85@apache.org>
Date: Fri, 26 Apr 2019 16:19:44 -0400
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101
 Thunderbird/60.6.1
MIME-Version: 1.0
In-Reply-To: <F9E97529-2C38-41F2-88A0-E8AD2E65594A@apache.org>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: quoted-printable

Hi Adam,

I'll bring up a concern from a recent client with whom I engaged.

They're on 1.x. On 1.x they have been doing 50k bulk update operations=20
in a single request. 1.x doesn't time out. The updates are such that=20
they guarantee that none will result in a conflict or be rejected, so=20
all 50k are accepted. They do this so it appears atomic to the next=20
reader - a read from another client can't occur in the middle of the big=20
update, because we have a single couch_file in 1.x.

Obviously, in 2.x this doesn't work on two levels. First, there's=20
multiple readers and writers across a cluster, so the big bulk operation=20
doesn't act as a blocker until it's finished for any interposed reads.=20
Second, you can't reliably finish 50k updates in a single batch in a=20
cluster anyway, because you'll probably hit the fabric timeout, if not=20
other cluster timeouts.

As a general rule of thumb, I advise people to keep bulk document=20
updates to no more than batches of 1k at a time, with the understanding=20
that in 2.x these are not treated as an atomic transaction (and they=20
weren't strictly that way in 1.x, either, but never mind that...)

If we decide as a project that all operations must take less than 5=20
seconds, we're probably going to have to reduce the bulk update batch=20
size even further. I'm betting 100 would be the upper bound on bulk updat=
es.

Is this going to impose a significant performance penalty on bulk ops?

-Joan

On 2019-04-26 3:30 p.m., Adam Kocoloski wrote:
> Hi all,
>=20
> The point I=E2=80=99m on is that we should take advantage of this extra=
 bit of information that we acquire out-of-band (e.g. we just decide as a=
 project that all operations take less than 5 seconds) and come up with s=
marter / cheaper / faster ways of doing load shedding based on that infor=
mation.
>=20
> For example, yes it could be interesting to use is_process_alive/1 to s=
ee if a client is still hanging around, and have the gen_server discard t=
he work otherwise. It might also be too expensive to matter; I=E2=80=99m =
not sure anyone here has a good a priori sense of the cost of that call. =
But I=E2=80=99d certainly wager it=E2=80=99s more expensive than calling =
timer:now_diff/2 in the server and discarding any requests that were subm=
itted more than 5 seconds ago.
>=20
> Most of our timeout / cleanup solutions to date have been focused top-d=
own, without making any assumptions about the behavior of the workers or =
servers underneath. I think we should try to approach this problem bottom=
s-up, forcing every call to complete within 5 seconds and handling timeou=
ts correctly as they bubble up.
>=20
> Adam
>=20
>> On Apr 23, 2019, at 2:48 PM, Nick Vatamaniuc <vatamane@gmail.com> wrot=
e:
>>
>> We don't spawn (/link) or monitor remote processes, just monitor the l=
ocal
>> coordinator process. That should cheaper performance-wise. It's also f=
or
>> relatively long running streaming fabric requests (changes, all_docs).=
 But
>> you're right, perhaps doing these for shorter requests (doc updates, d=
oc
>> GETs) might become noticeable. Perhaps a pool of reusable monitoring
>> processes work there...
>>
>> For couch_server timeouts. I wonder if we can do a simpler thing and
>> inspect the `From` part of each call and if the Pid is not alive drop =
the
>> requestor at least avoid doing any expensive processing. For casts it =
might
>> involve sending a sender Pid in the message. That doesn't address time=
outs,
>> just the case where the coordinating process went away while the messa=
ge
>> was stuck in the long message queue.
>>
>> On Mon, Apr 22, 2019 at 4:32 PM Robert Newson <rnewson@apache.org> wro=
te:
>>
>>> My memory is fuzzy, but those items sound a lot like what happens wit=
h
>>> rex, that motivated us (i.e, Adam) to build rexi, which deliberately =
does
>>> less than the stock approach.
>>>
>>> --
>>>   Robert Samuel Newson
>>>   rnewson@apache.org
>>>
>>> On Mon, 22 Apr 2019, at 18:33, Nick Vatamaniuc wrote:
>>>> Hi everyone,
>>>>
>>>> We partially implement the first part (cleaning rexi workers) for al=
l
>>>> the
>>>> fabric streaming requests. Which should be all_docs, changes, view m=
ap,
>>>> view reduce:
>>>>
>>> https://github.com/apache/couchdb/commit/632f303a47bd89a97c831fd0532c=
b7541b80355d
>>>>
>>>> The pattern there is the following:
>>>>
>>>> - With every request spawn a monitoring process that is in charge of
>>>> keeping track of all the workers as they are spawned.
>>>> - If regular cleanup takes place, then this monitoring process is
>>> killed,
>>>> to avoid sending double the number of kill messages to workers.
>>>> - If the coordinating process doesn't run cleanup and just dies, the
>>>> monitoring process will performs cleanup on its behalf.
>>>>
>>>> Cheers,
>>>> -Nick
>>>>
>>>>
>>>>
>>>> On Thu, Apr 18, 2019 at 5:16 PM Robert Samuel Newson <rnewson@apache=
.org
>>>>
>>>> wrote:
>>>>
>>>>> My view is a) the server was unavailable for this request due to al=
l
>>> the
>>>>> other requests it=E2=80=99s currently dealing with b) the connectio=
n was not
>>> idle,
>>>>> the client is not at fault.
>>>>>
>>>>> B.
>>>>>
>>>>>> On 18 Apr 2019, at 22:03, Done Collectively <sansato@inator.biz>
>>> wrote:
>>>>>>
>>>>>> Any reason 408 would be undesirable?
>>>>>>
>>>>>> https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/408
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 18, 2019 at 10:37 AM Robert Newson <rnewson@apache.org=
>
>>>>> wrote:
>>>>>>
>>>>>>> 503 imo.
>>>>>>>
>>>>>>> --
>>>>>>> Robert Samuel Newson
>>>>>>> rnewson@apache.org
>>>>>>>
>>>>>>> On Thu, 18 Apr 2019, at 18:24, Adam Kocoloski wrote:
>>>>>>>> Yes, we should. Currently it=E2=80=99s a 500, maybe there=E2=80=99=
s something more
>>>>>>> appropriate:
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>> https://github.com/apache/couchdb/blob/8ef42f7241f8788afc1b6e7255ce78=
ce5d5ea5c3/src/chttpd/src/chttpd.erl#L947-L949
>>>>>>>>
>>>>>>>> Adam
>>>>>>>>
>>>>>>>>> On Apr 18, 2019, at 12:50 PM, Joan Touzet <wohali@apache.org>
>>> wrote:
>>>>>>>>>
>>>>>>>>> What happens when it turns out the client *hasn't* timed out an=
d
>>> we
>>>>>>>>> just...hang up on them? Should we consider at least trying to s=
end
>>>>> back
>>>>>>>>> some sort of HTTP status code?
>>>>>>>>>
>>>>>>>>> -Joan
>>>>>>>>>
>>>>>>>>> On 2019-04-18 10:58, Garren Smith wrote:
>>>>>>>>>> I'm +1 on this. With partition queries, we added a few more
>>> timeouts
>>>>>>> that
>>>>>>>>>> can be enabled which Cloudant enable. So having the ability to
>>> shed
>>>>>>> old
>>>>>>>>>> requests when these timeouts get hit would be great.
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>> Garren
>>>>>>>>>>
>>>>>>>>>> On Tue, Apr 16, 2019 at 2:41 AM Adam Kocoloski <
>>> kocolosk@apache.org>
>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> For once, I=E2=80=99m coming to you with a topic that is not =
strictly
>>> about
>>>>>>>>>>> FoundationDB :)
>>>>>>>>>>>
>>>>>>>>>>> CouchDB offers a few config settings (some of them
>>> undocumented) to
>>>>>>> put a
>>>>>>>>>>> limit on how long the server is allowed to take to generate a
>>>>>>> response. The
>>>>>>>>>>> trouble with many of these timeouts is that, when they fire,
>>> they do
>>>>>>> not
>>>>>>>>>>> actually clean up all of the work that they initiated. A coup=
le
>>> of
>>>>>>> examples:
>>>>>>>>>>>
>>>>>>>>>>> - Each HTTP response coordinated by the =E2=80=9Cfabric=E2=80=
=9D application
>>> spawns
>>>>>>>>>>> several ephemeral processes via =E2=80=9Crexi" on different n=
odes in the
>>>>>>> cluster to
>>>>>>>>>>> retrieve data and send it back to the process coordinating th=
e
>>>>>>> response. If
>>>>>>>>>>> the request timeout fires, the coordinating process will be
>>> killed
>>>>>>> off, but
>>>>>>>>>>> the ephemeral workers might not be. In a healthy cluster they=
=E2=80=99ll
>>>>>>> exit on
>>>>>>>>>>> their own when they finish their jobs, but there are conditio=
ns
>>>>>>> under which
>>>>>>>>>>> they can sit around for extended periods of time waiting for =
an
>>>>>>> overloaded
>>>>>>>>>>> gen_server (e.g. couch_server) to respond.
>>>>>>>>>>>
>>>>>>>>>>> - Those named gen_servers (like couch_server) responsible for
>>>>>>> serializing
>>>>>>>>>>> access to important data structures will dutifully process
>>> messages
>>>>>>>>>>> received from old requests without any regard for (of even
>>> knowledge
>>>>>>> of)
>>>>>>>>>>> the fact that the client that sent the message timed out long
>>> ago.
>>>>>>> This can
>>>>>>>>>>> lead to a sort of death spiral in which the gen_server is
>>> ultimately
>>>>>>>>>>> spending ~all of its time serving dead clients and every clie=
nt
>>> is
>>>>>>> timing
>>>>>>>>>>> out.
>>>>>>>>>>>
>>>>>>>>>>> I=E2=80=99d like to see us introduce a documented maximum req=
uest
>>> duration
>>>>>>> for all
>>>>>>>>>>> requests except the _changes feed, and then use that
>>> information to
>>>>>>> aid in
>>>>>>>>>>> load shedding throughout the stack. We can audit the codebase
>>> for
>>>>>>>>>>> gen_server calls with long timeouts (I know of a few on the
>>> critical
>>>>>>> path
>>>>>>>>>>> that set their timeouts to `infinity`) and we can design serv=
ers
>>>>> that
>>>>>>>>>>> efficiently drop old requests, knowing that the client who ma=
de
>>> the
>>>>>>> request
>>>>>>>>>>> must have timed out. A couple of topics for discussion:
>>>>>>>>>>>
>>>>>>>>>>> - the =E2=80=9Cgen_server that sheds old requests=E2=80=9D is=
 a very generic
>>>>>>> pattern, one
>>>>>>>>>>> that seems like it could be well-suited to its own behaviour.=
 A
>>>>>>> cursory
>>>>>>>>>>> search of the internet didn=E2=80=99t turn up any prior art h=
ere, which
>>>>>>> surprises
>>>>>>>>>>> me a bit. I=E2=80=99m wondering if this is worth bringing up =
with the
>>>>> broader
>>>>>>>>>>> Erlang community.
>>>>>>>>>>>
>>>>>>>>>>> - setting and enforcing timeouts is a healthy pattern for
>>> read-only
>>>>>>>>>>> requests as it gives a lot more feedback to clients about the
>>> health
>>>>>>> of the
>>>>>>>>>>> server. When it comes to updates things are a little bit more
>>> muddy,
>>>>>>> just
>>>>>>>>>>> because there remains a chance that an update can be committe=
d,
>>> but
>>>>>>> the
>>>>>>>>>>> caller times out before learning of the successful commit. We
>>> should
>>>>>>> try to
>>>>>>>>>>> minimize the likelihood of that occurring.
>>>>>>>>>>>
>>>>>>>>>>> Cheers, Adam
>>>>>>>>>>>
>>>>>>>>>>> P.S. I did say that this wasn=E2=80=99t _strictly_ about Foun=
dationDB,
>>> but
>>>>> of
>>>>>>>>>>> course FDB has a hard 5 second limit on all transactions, so =
it
>>> is a
>>>>>>> bit of
>>>>>>>>>>> a forcing function :).Even putting FoundationDB aside, I woul=
d
>>> still
>>>>>>> argue
>>>>>>>>>>> to pursue this path based on our Ops experience with the curr=
ent
>>>>>>> codebase.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>
>>>
>=20