cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergio Bossa (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
Date Thu, 14 Jul 2016 09:09:20 GMT


Sergio Bossa commented on CASSANDRA-9318:


yes I think we're going in circles, and probably either one or the other is missing each other

You said:

bq. Pick a number for how much memory we can afford to have taken up by in flight requests
(remembering that we need to keep the entirely payload around for potential hint writing)
as a fraction of the heap, the way we do with memtables or key cache. If we hit that mark
we start throttling new requests and only accept them as old ones drain off.

I'll try a last attempt at explaining why _in my opinion_ this is worse _in all accounts_
via an example.

Say you have:
* U: the memory unit to express back-pressure.
* T: the write RPC timeout.
* 3 nodes cluster with RF=2.

Say the coordinator memory threshold is at 10U, and clients hit it with requests worth 20Us,
and for simplicity the coordinator is not a replica, which means it has to accommodate 40U
of inflight requests. At some point P < T, the first replica answers all of them, while
the second replica answers only half, which means the memory threshold is met at 10U requests,
which will be drained when either the replica answers (let's call this time interval R) or
T elapses. This means:
1) During a time period equal to min(R,T) you'll have 0 throughput. This is made worse if
you actually have to wait for the T to elapse, as it gets proportional to T. 
2) If replica 2 keeps exhibiting the same slow behaviour, you'll keep having a throughput
profile of high peaks and 0 valleys; this is bad not just because of the 0 valleys, but also
because during the high peaks the slow replica will end up dropping 10U worth of mutations.

Both #1 and #2 look pretty bad outcomes to me and definitely no better than throttling at
the speed of the slowest replica.

Speaking of which, how would the other algorithm work in such case? Given the same scenario
above, we'd have the following:
1) During the first T period (T1), there will be no back-pressure (this is because the algorithm
works at time windows of size T), so nothing changes from current behaviour.
2) At T2, it will start throttling at the slow replica rate of 10U (obviously the actual throttling
is not memory based in this case but I'll keep the same notation for simplicity): this means
that the sustained throughput during T2 will be 10U.
3) For all Tn > T2, throughput will *gradually* increase and eventually reduce, but without
ever touching 0, which means _no high peaks causing high dropped mutations rate_, _no 0 valleys
of length T_.

Hope this clarifies my reasoning, and please let me know if/how my example above is flawed
(that is, if I keep missing your point).

> Bound the number of in-flight requests at the coordinator
> ---------------------------------------------------------
>                 Key: CASSANDRA-9318
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Local Write-Read Paths, Streaming and Messaging
>            Reporter: Ariel Weisberg
>            Assignee: Sergio Bossa
>         Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, limit.btm,
> It's possible to somewhat bound the amount of load accepted into the cluster by bounding
the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding bytes and requests
and if it reaches a high watermark disable read on client connections until it goes back below
some low watermark.
> Need to make sure that disabling read on the client connection won't introduce other

This message was sent by Atlassian JIRA

View raw message