cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <>
Subject [jira] [Resolved] (CASSANDRA-12043) Syncing most recent commit in CAS across replicas can cause all CAS queries in the CQL partition to fail
Date Tue, 28 Jun 2016 13:26:57 GMT


Sylvain Lebresne resolved CASSANDRA-12043.
       Resolution: Fixed
         Reviewer: Jason Brown
    Fix Version/s: 3.9

I still needed to merge the branch upwards and wait on CI results for all branch. This is
now done and tests seem "fine" (no failure appears related) so committed. Thanks.

| [2.1|] | [utests|]
| [dtests|] ||
| [2.2|] | [utests|]
| [dtests|] ||
| [3.0|] | [utests|]
| [dtests|] ||
| [3.9|] | [utests|]
| [dtests|] ||

> Syncing most recent commit in CAS across replicas can cause all CAS queries in the CQL
partition to fail
> --------------------------------------------------------------------------------------------------------
>                 Key: CASSANDRA-12043
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: sankalp kohli
>            Assignee: Sylvain Lebresne
>             Fix For: 2.1.15, 2.2.7, 3.0.9, 3.9
> We update the most recent commit on requiredParticipant replicas if out of sync during
the prepare round in beginAndRepairPaxos method. We keep doing this in a loop till the requiredParticipant
replicas have the same most recent commit or we hit timeout. 
> Say we have 3 machines A,B and C and gc grace on the table is 10 days. We do a CAS write
at time 0 and it went to A and B but not to C.  C will get the hint later but will not update
the most recent commit in paxos table. This is how CAS hints work. 
> In the paxos table whose gc_grace=0, most_recent_commit in A and B will be inserted with
timestamp 0 and with a TTL of 10 days. After 10 days, this insert will become a tombstone
at time 0 till it is compacted away since gc_grace=0.
> Do a CAS read after say 1 day on the same CQL partition and this time prepare phase involved
A and C. most_recent_commit on C for this CQL partition is empty. A sends the most_recent_commit
to C with a timestamp of 0 and with a TTL of 10 days. This most_recent_commit on C will expire
on 11th day since it is inserted after 1 day. 
> most_recent_commit are now in sync on A,B and C, however A and B most_recent_commit will
expire on 10th day whereas for C it will expire on 11th day since it was inserted one day
> Do another CAS read after 10days when most_recent_commit on A and B have expired and
is treated as tombstones till compacted. In this CAS read, say A and C are involved in prepare
phase. most_recent_commit will not match between them since it is expired in A and is still
there on C. This will cause most_recent_commit to be applied to A with a timestamp of 0 and
TTL of 10 days. If A has not compacted away the original most_recent_commit which has expired,
this new write to most_recent_commit wont be visible on reads since there is a tombstone with
same timestamp(Delete wins over data with same timestamp). 
> Another round of prepare will follow and again A would say it does not know about most_recent_write(covered
by original write which is not a tombstone) and C will again try to send the write to A. This
can keep going on till the request timeouts or only A and B are involved in the prepare phase.

> When A’s original most_recent_commit which is now a tombstone is compacted, all the
inserts which it was covering will come live. This will in turn again get played to another
replica. This ping pong can keep going on for a long time. 
> The issue is that most_recent_commit is expiring at different times across replicas.
When they get replayed to a replica to make it in sync, we again set the TTL from that point.
> During the CAS read which timed out, most_recent_commit was being sent to another replica
in a loop. Even in successful requests, it will try to loop for a couple of times if involving
A and C and then when the replicas which respond are A and B, it will succeed. So this will
have impact on latencies as well. 
> These timeouts gets worse when a machine is down as no progress can be made as the machine
with unexpired commit is always involved in the CAS prepare round. Also with range movements,
the new machine gaining range has empty most recent commit and gets the commit at a later
time causing same issue. 
> Repro steps:
> 1. Paxos TTL is max(3 hours, gc_grace) as defined in SystemKeyspace.paxosTtl(). Change
this method to not put a minimum TTL of 3 hours. 
> Method  SystemKeyspace.paxosTtl() will look like return metadata.getGcGraceSeconds();
  instead of return Math.max(3 * 3600, metadata.getGcGraceSeconds());
> We are doing this so that we dont need to wait for 3 hours. 
> Create a 3 node cluster with the code change suggested above with machines A,B and C
> CREATE KEYSPACE  test WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor'
: 3 };
> use test;
> CREATE TABLE users (a int PRIMARY KEY,b int);
> alter table users WITH gc_grace_seconds=120;
> consistency QUORUM;
> bring down machine C
> INSERT INTO users (user_name, password ) VALUES ( 1,1) IF NOT EXISTS;
> Nodetool flush on machine A and B
> Bring up the down machine B 
> consistency SERIAL;
> tracing on;
> wait 80 seconds
> Bring up machine C
> select * from users where user_name = 1;
> Wait 40 seconds 
> select * from users where user_name = 1;  //All queries from this point forward will
> One of the potential fixes could be to set the TTL based on the remaining time left on
another replicas. This will be TTL-timestamp of write. This timestamp is calculated from ballot
which uses server time.

This message was sent by Atlassian JIRA

View raw message