cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tyler Hobbs (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-12256) Properly respect the request timeouts
Date Tue, 16 Aug 2016 16:46:20 GMT


Tyler Hobbs commented on CASSANDRA-12256:

It looks like the last dtest run actually errored out, and the one before that didn't have
your latest changes.  So, I've kicked off another run, and hopefully we can get clearer results
from that.

If we do still have failures due to timeouts, let's look at them individually to make sure
that they're not indicative of a problem with the patch.  If they aren't, raising the "custom"
timeout for the test probably makes the most sense.

> Properly respect the request timeouts
> -------------------------------------
>                 Key: CASSANDRA-12256
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Sylvain Lebresne
>            Assignee: Geoffrey Yu
>             Fix For: 3.x
>         Attachments: 12256-trunk-v1v2.diff, 12256-trunk-v2.txt, 12256-trunk.txt
> We have a number of {{request_timeout_*}} option, that probably every user expect to
be an upper bound on how long the coordinator will wait before timeouting a request, but it's
actually not always the case, especially for read requests.
> I believe we don't respect those timeout properly in at least the following cases:
> * On a digest mismatch: in that case, we reset the timeout for the data query, which
means the overall query might take up to twice the configured timeout before timeouting.
> * On a range query: the timeout is reset for every sub-range that is queried. With many
nodes and vnodes, a range query could span tons of sub-range and so a range query could take
pretty much arbitrary long before actually timeouting for the user.
> * On short reads: we also reset the timeout for every short reads "retries".
> It's also worth noting that even outside those, the timeouts don't take most of the processing
done by the coordinator (query parsing and CQL handling for instance) into account.
> Now, in all fairness, the reason this is this way is that the timeout currently are *not*
timeout for the full user request, but rather how long a coordinator should wait on any given
replica for any given internal query before giving up. *However*, I'm pretty sure this is
not what user intuitively expect and want, *especially* in the context of CASSANDRA-2848 where
the goal is explicitely to have an upper bound on the query from the user point of view.
> So I'm suggesting we change how those timeouts are handled to really be timeouts on the
whole user query.
> And by that I basically just mean that we'd mark the start of each query as soon as possible
in the processing, and use that starting time as base in {{ReadCallback.await}} and {{AbstractWriteResponseHandler.get()}}.
It won't be perfect in the sense that we'll still only possibly timeout during "blocking"
operations, so typically if parsing a query takes more than your timeout, you still won't
timeout until that query is sent, but I think that's probably fine in practice because 1)
if you timeouts are small enough that this matter, you're probably doing it wrong and 2) we
can totally improve on that later if needs be.

This message was sent by Atlassian JIRA

View raw message