couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Newson" <rnew...@apache.org>
Subject Re: [DISCUSS] Improve load shedding by enforcing timeouts throughout stack
Date Thu, 18 Apr 2019 17:37:13 GMT
503 imo.

-- 
  Robert Samuel Newson
  rnewson@apache.org

On Thu, 18 Apr 2019, at 18:24, Adam Kocoloski wrote:
> Yes, we should. Currently it’s a 500, maybe there’s something more appropriate:
> 
> https://github.com/apache/couchdb/blob/8ef42f7241f8788afc1b6e7255ce78ce5d5ea5c3/src/chttpd/src/chttpd.erl#L947-L949
> 
> Adam
> 
> > On Apr 18, 2019, at 12:50 PM, Joan Touzet <wohali@apache.org> wrote:
> > 
> > What happens when it turns out the client *hasn't* timed out and we
> > just...hang up on them? Should we consider at least trying to send back
> > some sort of HTTP status code?
> > 
> > -Joan
> > 
> > On 2019-04-18 10:58, Garren Smith wrote:
> >> I'm +1 on this. With partition queries, we added a few more timeouts that
> >> can be enabled which Cloudant enable. So having the ability to shed old
> >> requests when these timeouts get hit would be great.
> >> 
> >> Cheers
> >> Garren
> >> 
> >> On Tue, Apr 16, 2019 at 2:41 AM Adam Kocoloski <kocolosk@apache.org> wrote:
> >> 
> >>> Hi all,
> >>> 
> >>> For once, I’m coming to you with a topic that is not strictly about
> >>> FoundationDB :)
> >>> 
> >>> CouchDB offers a few config settings (some of them undocumented) to put
a
> >>> limit on how long the server is allowed to take to generate a response.
The
> >>> trouble with many of these timeouts is that, when they fire, they do not
> >>> actually clean up all of the work that they initiated. A couple of examples:
> >>> 
> >>> - Each HTTP response coordinated by the “fabric” application spawns
> >>> several ephemeral processes via “rexi" on different nodes in the cluster
to
> >>> retrieve data and send it back to the process coordinating the response.
If
> >>> the request timeout fires, the coordinating process will be killed off,
but
> >>> the ephemeral workers might not be. In a healthy cluster they’ll exit
on
> >>> their own when they finish their jobs, but there are conditions under which
> >>> they can sit around for extended periods of time waiting for an overloaded
> >>> gen_server (e.g. couch_server) to respond.
> >>> 
> >>> - Those named gen_servers (like couch_server) responsible for serializing
> >>> access to important data structures will dutifully process messages
> >>> received from old requests without any regard for (of even knowledge of)
> >>> the fact that the client that sent the message timed out long ago. This
can
> >>> lead to a sort of death spiral in which the gen_server is ultimately
> >>> spending ~all of its time serving dead clients and every client is timing
> >>> out.
> >>> 
> >>> I’d like to see us introduce a documented maximum request duration for
all
> >>> requests except the _changes feed, and then use that information to aid
in
> >>> load shedding throughout the stack. We can audit the codebase for
> >>> gen_server calls with long timeouts (I know of a few on the critical path
> >>> that set their timeouts to `infinity`) and we can design servers that
> >>> efficiently drop old requests, knowing that the client who made the request
> >>> must have timed out. A couple of topics for discussion:
> >>> 
> >>> - the “gen_server that sheds old requests” is a very generic pattern,
one
> >>> that seems like it could be well-suited to its own behaviour. A cursory
> >>> search of the internet didn’t turn up any prior art here, which surprises
> >>> me a bit. I’m wondering if this is worth bringing up with the broader
> >>> Erlang community.
> >>> 
> >>> - setting and enforcing timeouts is a healthy pattern for read-only
> >>> requests as it gives a lot more feedback to clients about the health of
the
> >>> server. When it comes to updates things are a little bit more muddy, just
> >>> because there remains a chance that an update can be committed, but the
> >>> caller times out before learning of the successful commit. We should try
to
> >>> minimize the likelihood of that occurring.
> >>> 
> >>> Cheers, Adam
> >>> 
> >>> P.S. I did say that this wasn’t _strictly_ about FoundationDB, but of
> >>> course FDB has a hard 5 second limit on all transactions, so it is a bit
of
> >>> a forcing function :).Even putting FoundationDB aside, I would still argue
> >>> to pursue this path based on our Ops experience with the current codebase.
> >> 
> > 
> 
>

Mime
View raw message