cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "sankalp kohli (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CASSANDRA-12043) Syncing most recent commit in CAS across replicas can cause all CAS queries in the CQL partition to fail
Date Tue, 21 Jun 2016 03:04:57 GMT
sankalp kohli created CASSANDRA-12043:
-----------------------------------------

             Summary: Syncing most recent commit in CAS across replicas can cause all CAS
queries in the CQL partition to fail
                 Key: CASSANDRA-12043
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12043
             Project: Cassandra
          Issue Type: Bug
            Reporter: sankalp kohli


We update the most recent commit on requiredParticipant replicas if out of sync during the
prepare round in beginAndRepairPaxos method. We keep doing this in a loop till the requiredParticipant
replicas have the same most recent commit or we hit timeout. 

Say we have 3 machines A,B and C and gc grace on the table is 10 days. We do a CAS write at
time 0 and it went to A and B but not to C.  C will get the hint later but will not update
the most recent commit in paxos table. This is how CAS hints work. 
In the paxos table whose gc_grace=0, most_recent_commit in A and B will be inserted with timestamp
0 and with a TTL of 10 days. After 10 days, this insert will become a tombstone at time 0
till it is compacted away since gc_grace=0.

Do a CAS read after say 1 day on the same CQL partition and this time prepare phase involved
A and C. most_recent_commit on C for this CQL partition is empty. A sends the most_recent_commit
to C with a timestamp of 0 and with a TTL of 10 days. This most_recent_commit on C will expire
on 11th day since it is inserted after 1 day. 

most_recent_commit are now in sync on A,B and C, however A and B most_recent_commit will expire
on 10th day whereas for C it will expire on 11th day since it was inserted one day later.


Do another CAS read after 10days when most_recent_commit on A and B have expired and is treated
as tombstones till compacted. In this CAS read, say A and C are involved in prepare phase.
most_recent_commit will not match between them since it is expired in A and is still there
on C. This will cause most_recent_commit to be applied to A with a timestamp of 0 and TTL
of 10 days. If A has not compacted away the original most_recent_commit which has expired,
this new write to most_recent_commit wont be visible on reads since there is a tombstone with
same timestamp(Delete wins over data with same timestamp). 

Another round of prepare will follow and again A would say it does not know about most_recent_write(covered
by original write which is not a tombstone) and C will again try to send the write to A. This
can keep going on till the request timeouts or only A and B are involved in the prepare phase.


When A’s original most_recent_commit which is now a tombstone is compacted, all the inserts
which it was covering will come live. This will in turn again get played to another replica.
This ping pong can keep going on for a long time. 

The issue is that most_recent_commit is expiring at different times across replicas. When
they get replayed to a replica to make it in sync, we again set the TTL from that point. 

During the CAS read which timed out, most_recent_commit was being sent to another replica
in a loop. Even in successful requests, it will try to loop for a couple of times if involving
A and C and then when the replicas which respond are A and B, it will succeed. So this will
have impact on latencies as well. 

These timeouts gets worse when a machine is down as no progress can be made as the machine
with unexpired commit is always involved in the CAS prepare round. Also with range movements,
the new machine gaining range has empty most recent commit and gets the commit at a later
time causing same issue. 

Repro steps:
1. Paxos TTL is max(3 hours, gc_grace) as defined in SystemKeyspace.paxosTtl(). Change this
method to not put a minimum TTL of 3 hours. 
Method  SystemKeyspace.paxosTtl() will look like return metadata.getGcGraceSeconds();   instead
of return Math.max(3 * 3600, metadata.getGcGraceSeconds());
We are doing this so that we dont need to wait for 3 hours. 

Create a 3 node cluster with the code change suggested above with machines A,B and C
CREATE KEYSPACE  test WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor'
: 3 };
use test;
CREATE TABLE users (a int PRIMARY KEY,b int);
alter table users WITH gc_grace_seconds=120;
consistency QUORUM;
bring down machine C
INSERT INTO users (user_name, password ) VALUES ( 1,1) IF NOT EXISTS;
Nodetool flush on machine A and B
Bring up the down machine B 
consistency SERIAL;
tracing on;
wait 80 seconds
Bring up machine C
select * from users where user_name = 1;
Wait 40 seconds 
select * from users where user_name = 1;  //All queries from this point forward will timeout.


One of the potential fixes could be to set the TTL based on the remaining time left on another
replicas. This will be TTL-timestamp of write. This timestamp is calculated from ballot which
uses server time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message