cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Ellis (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-5062) Support CAS
Date Fri, 01 Mar 2013 03:37:15 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13590210#comment-13590210
] 

Jonathan Ellis commented on CASSANDRA-5062:
-------------------------------------------

A more-complete sketch of the coordinator side:

{code}
.   /**
     * Apply @param updates if and only if the current values in the row for @param key
     * match the ones given by @param old.  The algorithm is "raw" Paxos: that is, Paxos
     * minus leader election -- any node in the cluster may propose changes for any row,
     * which (that is, the row) is the unit of values being proposes, not single columns.
     *
     * The Paxos cohort is only the replicas for the given key, not the entire cluster.
     * So we expect performance to be reasonable, but CAS is still intended to be used
     * "when you really need it," not for all your updates.
     *
     * There are three phases to Paxos:
     *  1. Prepare: the coordinator generates a ballot (timeUUID in our case) and asks replicas
to (a) promise
     *     not to accept updates from older ballots and (b) tell us about the most recent
update it has already
     *     accepted.
     *  2. Accept: if a majority of replicas reply, the coordinator asks replicas to accept
the value of the
     *     highest proposal ballot it heard about, or a new value if no in-progress proposals
were reported.
     *  3. Commit (Learn): if a majority of replicas acknowledge the accept request, we can
commit the new
     *     value.
     *
     *  Commit procedure is not covered in "Paxos Made Simple," and only briefly mentioned
in "Paxos Made Live,"
     *  so here is our approach:
     *   3a. The coordinator sends a commit message to all replicas with the ballot and value.
     *   3b. Because of 1-2, this will be the highest-seen commit ballot.  The replicas will
note that,
     *       and send it with subsequent promise replies.  This allows us to discard acceptance
records
     *       for successfully committed replicas, without allowing incomplete proposals to
commit erroneously
     *       later on.
     *
     *  Note that since we are performing a CAS rather than a simple update, we perform a
read (of committed
     *  values) between the prepare and accept phases.  This gives us a slightly longer window
for another
     *  coordinator to come along and trump our own promise with a newer one but is otherwise
safe.
     *
     * @return true if the operation succeeds in updating the row
     */
    public static boolean cas(String table, ByteBuffer key, ColumnFamily expected, ColumnFamily
updates)
    {
        // begin a paxos round
        UUID paxosBallot = UUIDGen.getTimeUUID();
        Token tk = StorageService.getPartitioner().getToken(key);
        List<InetAddress> naturalEndpoints = StorageService.instance.getNaturalEndpoints(table,
tk);
        Collection<InetAddress> pendingEndpoints = StorageService.instance.getTokenMetadata().pendingEndpointsFor(tk,
table);
        Iterable<InetAddress> allEndpoints = Iterables.concat(naturalEndpoints, pendingEndpoints);
        Map<InetAddress, PaxosPrepareResponse> inProgressProposals = preparePaxos(paxosBallot,
allEndpoints);

        // if we didn't hear from a majority of replicas, we have to bail for now
        int quorum = 1 + Iterables.size(allEndpoints) / 2;
        if (inProgressProposals.size() < quorum)
            throw new UnavailableException(ConsistencyLevel.SERIAL, quorum, inProgressProposals.size());

        // summarize the responses
        UUID mostRecentCommitted = UUIDGen.minTimeUUID(0);
        UUID inProgressBallot = UUIDGen.minTimeUUID(0);
        ColumnFamily inProgressUpdates;
        for (PaxosPrepareResponse response : inProgressProposals.values())
        {
            if (!response.promised)
            {
                // technically, we can proceed if we get a majority of successful promises,
                // but if a higher proposal number already exists even on one replica,
                // chances are it will be on more by the time we send a value to accept
                return false;
            }

            if (timeComparator.compare(response.mostRecentCommitted, mostRecentCommitted)
> 0)
                mostRecentCommitted = response.mostRecentCommitted;
            if (timeComparator.compare(response.inProgressBallot, inProgressBallot) > 0)
            {
                inProgressBallot = response.inProgressBallot;
                inProgressUpdates = response.inProgressUpdates;
            }
        }

        // complete earlier, in-progress rounds if necessary
        if (inProgressUpdates != null && timeComparator.compare(inProgressBallot,
mostRecentCommitted) >= 0)
        {
            RowMutation rm = new RowMutation(table, key, inProgressUpdates);
            if (acceptPaxos(paxosBallot, allEndpoints, rm))
                commitPaxos(paxosBallot, allEndpoints, rm);
            return false;
        }

        // read the current value
        ReadCommand readCommand = new SliceByNamesReadCommand(filter for expected);
        List<Row> rows = read(readCommand, ConsistencyLevel.QUORUM);

        // compare current value with expected
        if (expected does not match current)
            return false;

        // finish the paxos round w/ the desired updates
        RowMutation rm = new RowMutation(table, key, updates);
        RowMutation rm = new RowMutation(table, key, inProgressUpdates);
        if (acceptPaxos(paxosBallot, allEndpoints, rm))
        {
            commitPaxos(paxosBallot, allEndpoints, rm);
            return true;
        }
        return false;
    }

    private static class PaxosPrepareResponse
    {
        public final boolean promised;
        public final UUID mostRecentCommitted;
        public final UUID inProgressBallot;
        public final ColumnFamily inProgressUpdates;
    }
{code}
                
> Support CAS
> -----------
>
>                 Key: CASSANDRA-5062
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5062
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: API, Core
>            Reporter: Jonathan Ellis
>             Fix For: 2.0
>
>
> "Strong" consistency is not enough to prevent race conditions.  The classic example is
user account creation: we want to ensure usernames are unique, so we only want to signal account
creation success if nobody else has created the account yet.  But naive read-then-write allows
clients to race and both think they have a green light to create.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message