cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-10041) "timeout during write query at consistency ONE" when updating counter at consistency QUORUM and 2 of 3 nodes alive
Date Thu, 07 Apr 2016 08:58:25 GMT


Sylvain Lebresne commented on CASSANDRA-10041:

bq. In this case, it is not possible to identify in which phase the counter mutation failed.

That's right, you can't (identify in which phase the counter mutation failed). But given how
counters currently work we can't send you that information: the timeout is sent by the coordinator
which only get acks once everything is finished, so if it doesn't get acks, it doesn't know
which phase we're in. We'd need to change the protocol used internally as suggested a long
time ago in CASSANDRA-3199, but we've so far decided that the ROI for that wasn't good enough
(mostly due to the huge headache that making this change while maintaining backward compatibility/rolling
upgrade would be). Note in particular that even doing that wouldn't _avoid_ the timeout, it
would just make a tiny bit more info available to the coordinator when it happens but that
info might not even help being sure whether the counter update has been persisted or not.

Overall, closing that issue as not a problem. Yes, whenever a node dies some counter inserts
can timeout during the windows it takes for the failure detector to mark that node dead and
this even if you have in theory enough nodes alive to fulfill the CL requirements. And yes,
that's sad. But it's unfortunately a intrinsic limitation of the counter design for which
we don't have a solution.

Or to put it another way, this is working as designed, which doesn't mean we disagree that
this is a weakness of said design.

> "timeout during write query at consistency ONE" when updating counter at consistency
QUORUM and 2 of 3 nodes alive
> ------------------------------------------------------------------------------------------------------------------
>                 Key: CASSANDRA-10041
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: centos 6.6 server, java version "1.8.0_45", cassandra 2.1.8, 3 machines,
keyspace with replication factor 3
>            Reporter: Anton Lebedevich
>             Fix For: 2.1.x
> Test scenario is: kill -9 one node, wait 60 seconds, start it back, wait till it becomes
available, wait 120 seconds (during that time all 3 nodes are up), repeat with the next node.
Application reads from one table and updates counters in another table with consistency QUORUM.
When one node out of 3 is killed application logs this exception for several seconds:
> {noformat}
> Caused by: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout
during write query at consistency ONE (1 replica were required but only 0 acknowledged the
>         at com.datastax.driver.core.Responses$Error$1.decode( ~[com.datastax.cassandra.cassandra-driver-core-2.1.6.jar:na]
>         at com.datastax.driver.core.Responses$Error$1.decode( ~[com.datastax.cassandra.cassandra-driver-core-2.1.6.jar:na]
>         at com.datastax.driver.core.Message$ProtocolDecoder.decode(
>         at com.datastax.driver.core.Message$ProtocolDecoder.decode(
>         at io.netty.handler.codec.MessageToMessageDecoder.channelRead(
>         ... 13 common frames omitted
> {noformat}

This message was sent by Atlassian JIRA

View raw message