cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergio Bossa (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
Date Mon, 11 Jul 2016 10:38:11 GMT


Sergio Bossa commented on CASSANDRA-9318:


bq. Do we track the case where we receive a failed response? Specifically, in ResponseVerbHandler.doVerb,
shouldn't we call updateBackPressureState() also when the message is a failure response?

Good point, I focused more on the success case, due to dropped mutations, but that sounds
like a good thing to do.

bq. If we receive a response after it has timed out, won't we count that request twice, incorrectly
increasing the rate for that window?

But can that really happen? {{ResponseVerbHandler}} returns _before_ incrementing back-pressure
if the callback is null (i.e. expired), and {{OutboundTcpConnection}} doesn't even send outbound
messages if they're timed out, or am I missing something?

bq. I also argue that it is quite easy to comment out the strategy and to have an empty strategy
in the code that means no backpressure.

Again, I believe this would make enabling/disabling back-pressure via JMX less user friendly.

bq. I think what we may need is a new companion snitch that sorts the replica by backpressure

I do not think sorting replicas is what we really need, as you have to send the mutation to
all replicas anyway. I think what you rather need is a way to pre-emptively fail if the write
consistency level is not met by enough "non-overloaded" replicas, i.e.:
* If CL.ONE, fail in *all* replicas are overloaded.
* If CL.QUORUM, fail if *quorum* replicas are overloaded.
* if CL.ALL, fail if *any* replica is overloaded.

This can be easily accomplished in {{StorageProxy#sendToHintedEndpoints}}.

bq. the exception needs to be different. native_protocol_v4.spec clearly states

I missed that too :(

This leaves us with two options:
* Adding a new exception to the native protocol.
* Reusing a different exception, with {{WriteFailureException}} and {{UnavailableException}}
the most likely candidates.

I'm currently leaning towards the latter option.

bq. By "load shedding by the replica" do we mean dropping mutations that have timed out or
something else?


bq. Regardless, there is the problem of ensuring that all nodes have backpressure enabled,
which may not be trivial.

We only need to ensure the coordinator for that specific mutation has back-pressure enabled,
and we could do this by "marking" the {{MessageOut}} with a special parameter, what do you

> Bound the number of in-flight requests at the coordinator
> ---------------------------------------------------------
>                 Key: CASSANDRA-9318
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Local Write-Read Paths, Streaming and Messaging
>            Reporter: Ariel Weisberg
>            Assignee: Sergio Bossa
>         Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, limit.btm,
> It's possible to somewhat bound the amount of load accepted into the cluster by bounding
the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding bytes and requests
and if it reaches a high watermark disable read on client connections until it goes back below
some low watermark.
> Need to make sure that disabling read on the client connection won't introduce other

This message was sent by Atlassian JIRA

View raw message