cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergio Bossa (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
Date Wed, 14 Sep 2016 10:46:20 GMT


Sergio Bossa commented on CASSANDRA-9318:

bq. It's possible I'm totally missing a point you and Stefania are trying to make, but it
seems to me to be the only reasonable way. The timeout is a deadline the user asks us to respect,
it's the whole point of it, so it should always be respected as strictly as possible.

I didn't follow the two issues mentioned above: if that's the end goal, I agree we should
be strict with it.

bq. that discussion seems to suggest that back-pressure would make it harder for C* to respect
a reasonable timeout. I'll admit that sounds counter-intuitive to me as a functioning back-pressure
should make it easier by smoothing things over when there is too much pressure.

The load is smoothed on the server, then it depends on how many replicas are "in trouble"
and how much aggressive clients are. As an example, say the timeout is 2s, the request incoming
rate at the coordinator is 1000/s, the processing rate at replica A is 50/s and at replica
B is 1000/s, then with CL.ONE (assuming the coordinator is not part of the replica group for
1) If back-pressure is disabled, we get no client timeouts, but ~900 mutations dropped on
replica A.
2) If back-pressure is enabled, the back-pressure rate limiting at the coordinator is set
at 50/s (assuming the SLOW configuration) to smooth the load between servers, which means
~900 mutations will end up in client timeouts, and it will be the client responsibility to
back down to a saner ingestion rate; that is, if it keeps ingesting at a higher rate, there's
nothing we can do to smooth its side of the equation.

In case #2, the client timeout can be seen as a signal for the client to slow down, so I'm
fine with that (we also hinted in the past at adding more back-pressure related information
to the exception, but it seems this requires a change to the native protocol?). 

That said, this is a bit of an edge case: most of the time, when there is such difference
between replica responsiveness, it's because of transient short-lived events such as GC or
compaction spikes, and the back-pressure algorithm will not catch that, as it's meant to react
to continuous overloading.

On the other hand, when the node is continuously overloaded by clients, most of the time all
replicas will suffer from that, and the back-pressure will smooth out the load; in such case,
the rate of client timeouts shouldn't really change much, but I'll do another test with such
new changes.

Hope this helps clarifying things a little bit.

> Bound the number of in-flight requests at the coordinator
> ---------------------------------------------------------
>                 Key: CASSANDRA-9318
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Local Write-Read Paths, Streaming and Messaging
>            Reporter: Ariel Weisberg
>            Assignee: Sergio Bossa
>         Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, limit.btm,
> It's possible to somewhat bound the amount of load accepted into the cluster by bounding
the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding bytes and requests
and if it reaches a high watermark disable read on client connections until it goes back below
some low watermark.
> Need to make sure that disabling read on the client connection won't introduce other

This message was sent by Atlassian JIRA

View raw message