cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-5062) Support CAS
Date Fri, 01 Mar 2013 16:01:14 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13590654#comment-13590654
] 

Sylvain Lebresne commented on CASSANDRA-5062:
---------------------------------------------

bq. I think you're correct that we need to wait for a majority to agree on mostRecentCommitted
before proceeding.

I certainly don't fully understand your idea, but Paxos doesn't guarantee that only one proposer
will commit. So while you wait on a majority to agree on mostRecentCommitted (to erase the
state and start the next round of paxos), any concurrent CAS may end up committing (the guarantee
of paxos being that the value committed will be the same one than any other commit for that
round), which will change the "mostRecentCommitted".

bq. Yes, that's the tricky part, and none of the papers go into detail here.

I agree on both count.

Let me try to describe a bit my own current reasoning on this issue (which mainly correspond
to my understanding of what "Paxos made live" describe (even though, they do don't go into
very much detail)). Maybe that way you can tell me how your idea differ.

So, what we're going to do is successive rounds of paxos. So we make the round explicit. Concretely,
we have a round number that is shipped with every message, and each round is a completely
separated instance of the paxos algorithm. And so each round will yield a value. The idea
is that this way, we build a consistent and replicated log of what operation to apply in which
order (operation that we will apply to the data in said order, that the listening part, but
paxos is here to ensure consistency of the log of operation). I.e. round N is about "let's
agree on what's the Nth operation we'll apply on the data".

Concretely, the first thing a proposer does is to fetch his last known committed round N (meaning,
during commit, on top of actually applying the value/operation agreed on, we also basically
write that round N is committed).  Then it start a normal paxos for round N+1 (with whatever
value/operation he want to get in).

You do the full round of paxos and one operation wins. As said above, the "listening" part
just consist in recording that this is the value for round N and to apply it to the actual
CF.

Note that you don't have to wait for anything during commit in that case. It's ok for a proposer
to start a proposal on round N even though that round has been committed already (but the
proposer is not aware yet). It's ok because Paxos guarantees said proposer will learn the
value committed for this round N (and will commit it itself in fact). After which it will
know N is committed and will be free to start a new proposal for round N+1 with the actual
value/operation he wanted to get in in the first place.

That's actually what happens to a node that fails. When it's back up, it might not be up to
date on the most recent round. But that's ok, it will try some old round, will learn the value
committed for said round (not his own since it's behind on what's committed), commit it (which
has mainly the effect of learning the value locally, it will be old news for other replica)
and continue like that until he catches up.

That's the basic idea. Now there is 2 things that are clearly not very practical:
# this suppose we keep the replicated log of operation (i.e. list of which operation has been
agreed on for each round) forever. And that's ever growing.
# the part above about back to life replica catching up by redoing all missed round of paxos
one by one is obviously a bit lame.

I think both problem are linked, and really are the issue discussed in "Paxos made live" in
session 5.5-Snapshots. Now, the good news is that they have been able to optimize both of
those in practice without breaking correctness, so it's possible :). That being said, their
snapshot idea sound fairly heavy-weight. But I think it should be possible to do simpler as
we control the data (which they don't and explicitely cite as a difficulty). I have a few
idea, but honestly they are not fully though trough yet (though let's say that 1) is the hard
part imo. For 2) you can have other replica "repair" a proposer that starts a proposal on
an old round so it catches up right away).

Now, if I ignore 1) and 2) for a minute, I think I understand the principle enough to be convinced
that it is correct. Also, this is really just using paxos to build a consistent and replicated
log of the operation applied, but that log is separated from the data itself. So while this
does mean we would have to basically have paxos columns for which all update goes through
paxos, this also means that there is nothing to do on the read path itself.

Obviously, 1) and 2) need to be optimized for the idea to be realistic, and this without breaking
correctness. And while I have ideas, I admit I haven't though all the details through. Still,
I wanted to dump my current reasoning on this out.

I note that it could be that Jonathan's proposal is in fact just the same thing, but where
1) and 2) are optimized so well that there is no real history to keep at all and so no real
need to keep track of rounds per se. Which would be great, but if that's the case I think
we do need to really understand how theses optimizations work and convince ourselves that
it does preserve correctness. I'd personaly need more detail for that :)

                
> Support CAS
> -----------
>
>                 Key: CASSANDRA-5062
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: API, Core
>            Reporter: Jonathan Ellis
>             Fix For: 2.0
>
>
> "Strong" consistency is not enough to prevent race conditions.  The classic example is
user account creation: we want to ensure usernames are unique, so we only want to signal account
creation success if nobody else has created the account yet.  But naive read-then-write allows
clients to race and both think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message