cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefania (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-9318) Bound the number of in-flight requests at the coordinator
Date Tue, 12 Jul 2016 03:10:11 GMT


Stefania commented on CASSANDRA-9318:

bq. But can that really happen? ResponseVerbHandler returns before incrementing back-pressure
if the callback is null (i.e. expired), and OutboundTcpConnection doesn't even send outbound
messages if they're timed out, or am I missing something?

You're correct, we won't count twice because the callback is already null. However, this raises
another point, if a message expires before it is sent, we consider this negatively for that
replica, since we increment the outgoing rate but not the incoming rate when the callback
expires, and still it may have nothing to do with the replica if the message was not sent,
it may be due to the coordinator dealing with too many messages.
bq. Again, I believe this would make enabling/disabling back-pressure via JMX less user friendly.

Fine, let's keep the boolean since it makes life easier for JMX.

bq. I do not think sorting replicas is what we really need, as you have to send the mutation
to all replicas anyway. I think what you rather need is a way to pre-emptively fail if the
write consistency level is not met by enough "non-overloaded" replicas, i.e.:

You're correct in that the replicas are not sorted in the write path, only in the read path.
I confused the two yesterday. For sure we need to only fail if the write consistency level
is not met. I also observe that if a replica has a low rate, then we may block when acquiring
the limiter, and this will indirectly throttle for all following replicas, even if they were
ready to receive mutations sooner. Therefore, even a single overloaded or slow replica may
slow the entire write operation. Further, AbstractWriteResponseHandler sets the start time
in the constructor, so the time spent acquiring a rate limiter for slow replicas counts towards
the total time before the coordinator throws a write timeout exception. So, unless we increase
the write RPC timeout or change the existing behavior, we may observe write timeout exceptions
and, at CL.ANY, hints.

Also, in SP.sendToHintedEndpoints(), we should apply backpressure only if the destination
is alive.
This leaves us with two options:

    Adding a new exception to the native protocol.
    Reusing a different exception, with WriteFailureException and UnavailableException the
most likely candidates.

I'm currently leaning towards the latter option.

Let's use UnavailableException since WriteFailureException indicates a non-timeout failure
when processing a mutation, and so it is not appropriate for this case. For protocol V4 we
cannot change UnavailableException, but for V5 we should add a new parameter to it. At the
moment it contains {{<cl><required><alive>}}, we should add the number of
overloaded replicas, so that drivers can treat the
two cases differently. Another alternative, as suggested by [~slebresne], is to simply consider
overloaded replicas as dead and hint them, therefore throwing unavailable exceptions as usual,
but this is slightly less accurate then letting clients know that some replicas were unavailable
and some simply overloaded.
bq. We only need to ensure the coordinator for that specific mutation has back-pressure enabled,
and we could do this by "marking" the MessageOut with a special parameter, what do you think?

Marking messages as throttled would let the replica know if backpressure was enabled, that's
true, but it also makes the existing mechanism even more complex. Also, as far as I understand
it, dropping mutations that have been in the queue for longer that the RPC write timeout is
done not only to shed load on the replica, but also to avoid wasting resources to perform
a mutation when the coordinator has already returned a timeout exception to the client. I
think this still holds true regardless of backpressure. Since we cannot remove a timeout check
in the write response handlers, I don't see how it helps to drop it replica side. If the message
was throttled, even with cross_node_timeout enabled, the replica should have time to process
it before the RPC write timeout expires, so I don't think the extra complexity is justified.

bq. If you all agree with that, I'll move forward and make that change.

To summarize, I agree with this change, provided the drivers can separate the two cases (node
unavailable vs. node overloaded), which they will be able to do with V5 of the native protocol.
The alternative, would be to simply consider overloaded replicas as dead and hint them. Further,
I still have concerns regarding additional write timeout exceptions and whether an overloaded
or slow replica can slow everything down. [~slebresne], [~jbellis] anything else from your
side? I think Jonathan's proposal of bounding total outstanding requests to all replicas,
is somewhat different than what Sergio is trying to do but it can be discussed further.

> Bound the number of in-flight requests at the coordinator
> ---------------------------------------------------------
>                 Key: CASSANDRA-9318
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Local Write-Read Paths, Streaming and Messaging
>            Reporter: Ariel Weisberg
>            Assignee: Sergio Bossa
>         Attachments: 9318-3.0-nits-trailing-spaces.patch, backpressure.png, limit.btm,
> It's possible to somewhat bound the amount of load accepted into the cluster by bounding
the number of in-flight requests and request bytes.
> An implementation might do something like track the number of outstanding bytes and requests
and if it reaches a high watermark disable read on client connections until it goes back below
some low watermark.
> Need to make sure that disabling read on the client connection won't introduce other

This message was sent by Atlassian JIRA

View raw message