cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carl Yeksigian <>
Subject Re: Handling uncommitted paxos state
Date Thu, 25 Feb 2016 18:35:15 GMT
The paxos state is written to a system table (system.paxos) on each of the
paxos coordinators, so it goes through the normal write path, including
persisting to the log and being stored in a memtable until being flushed to
disk. As such, the state can survive restarts. These states are not treated
differently from our normal memtables, so there isn't any special handling
for a GC.

There is no process which will come in and fix up the values; they are
fixed at a partition level when trying to perform a CAS operation, or when
reading at a SERIAL consistency. This operation happens at the partition,
so if any part of the partition is read of updated, it will finish previous

If you want to know more,
has a lot more information about lightweight transactions.


On Thu, Feb 25, 2016 at 4:23 AM, Nicholas Wilson <> wrote:

> Hi,
> I have some questions about the behaviour of 'uncommitted paxos state', as
> described here:
> If a WriteTimeoutException with WriteType.SIMPLE is thrown for a CAS
> write, that means that the paxos phase was successful, but the data
> couldn't be committed during the final 'commit/reset' phase. On the next
> SERIAL write or read, any other node can commit the write on behalf of the
> original proposer, and must do so in fact before forming a new ballot. The
> stops the columns from getting 'stuck' if the coordinator experiences a
> network partition after forming the ballot, but before committing.
> My questions are on the durability of the uncommitted state:
> Suppose CAS writes are infrequent, and it takes weeks before another write
> is done to that column; will the paxos state still be there, waiting
> forever until the next commit, or does it get automatically committed
> during GC if you wait long enough? (I don't see how it could be cleaned up
> by a GC though, since the nodes holding the paxos state don't know if the
> ballot was won or not.)
> Or, what if all the nodes are switched off (briefly); is the uncommitted
> paxos state persisted to disk in the log/journal, so the write can still be
> completed when the cluster comes back online?
> Finally, how granular is the paxos state? Will the uncommitted write be
> completed on the next SERIAL write that touches the same exact combination
> of cells, or is it per-column across the partition, or....? If the CAS
> write touches two or three cells in the row, will a subsequent SERIAL read
> from any one of those three columns complete the uncommitted state,
> presumably on the other columns as well?
> Thanks for your help,
> Nick
> ---
> Nick Wilson
> Software engineer, RealVNC

View raw message