cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-4285) Atomic, eventually-consistent batches
Date Thu, 05 Jul 2012 13:51:35 GMT


Sylvain Lebresne commented on CASSANDRA-4285:

If I understand that correctly, only the coordinator of a given batch might be able to replay
batches. The problem I can see with that is that if the node dies and you never "replace it"
(i.e. bring a node with the same IP back up), then you might never replay some batches. Which
put a strong burden on the operator not to screw up. Besides, the batches won't be replay
until a replacement node is brought up, which means that even if we replay it ultimately,
it can take an unbounded time to do it.

So I would also add a mechanism to allow other nodes to replay batches. For instance, when
a node A detects that another node B is down, it could check whether it has some batches for
B locally and replay them (node B will replay them too when he's back up but that doesn't

bq. we need to retry the read indefinitely in case another replica recovered

For that too we can use the failure detector to track which node we've successfully checked
since restart (avoids the "indefinitely" part).

bq. default RF will be 1; operators can increase if desired

I'll admit I find 1 just a bit too low for a default (especially given it'll be global) and
I would prefer at least 2. My reasoning is that:
# RF=1 is a tad unsafe as far as durability is concerned.
# RF=1 has the problem that if the one replica you've picked might timeout. Even if you automatically
retry another shard (which I'm not in favor of, see below), it will screw up the latency.
RF > 1 (with CL.ONE) largely mitigate that issue.
# A higher RF won't be slower during the writes (it will actually be faster because of my
preceding point) and that is really what we care about. If replay is a bit slower because
of it, it's not a big deal (especially given that there will never be much to replay).

bq. Part of the goal here is to avoid forcing the client to retry on TimedOutException. So
if we attempt a batchlog write that times out, we should also retry to another shard instead
of propagating TOE to the client.

I think that what this ticket will provide is an extention of the atomicity that exists for
batches to the same key to all batches, and I don't think this give us much more than that.
So I fully expect the retry policy for clients to be unchanged (most of the time client applications
want to retry because what they care about is to achieve a given consistency level, or because
they care that the data is replicated to at least X node).

In other words, I see a timeout as saying "I haven't been able to achieve the requested consistency
level in time". This ticket doesn't change that, it only makes stronger guarantee on the state
of the DB in that case (which is good). But I don't see why that would make us start doing
retry server-side.

bq. we shouldn't have to make the client retry for timeouts writing to the replicas either;
we can do the retry server-side

Same as above, I disagree :).

bq. Instead, we should introduce a new exception (InProgressException?) to indicate that the
data isn't available to read yet

As said above I think that this should still be a TimeoutException. However, I do see a point
in giving more info on what that timeout means and I've opened for CASSANDRA-4414 for that
(which I meant to do since some time anyway).  Having suceesfully wrote to the DCL could just
be one of the info we would add to the TimeoutException.

> Atomic, eventually-consistent batches
> -------------------------------------
>                 Key: CASSANDRA-4285
>                 URL:
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: API, Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
> I discussed this in the context of triggers (CASSANDRA-1311) but it's useful as a standalone
feature as well.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message