incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sylvain Lebresne <sylv...@datastax.com>
Subject Re: clarification of the consistency guarantees of Counters
Date Tue, 31 May 2011 07:59:23 GMT
>I went through https://issues.apache.org/jira/browse/CASSANDRA-1072
>and realize that the consistency guarantees of Counters are a bit different from those
of regular columns,

Not anymore.

> so could you please confirm that the following are true?
>1) comment https://issues.apache.org/jira/browse/CASSANDRA-1072?focusedCommentId=12900659&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12900659
>still holds : "there is no way to create a write CL greater than ONE, and thus, no defense
against permanent failures of single machines"

As said above, it doesn't hold anymore. Apart from CL.ANY, all the
usual consistency levels are supported and give you the exact same
guarantee they give you with regular columns (including durability
guarantee). As for CL.ANY, it may be possible to support it in the
future but it would require extra code and it may have a few
drawbacks.

>2) due to the above, the best I can achieve to increase reliability is to enable REPLICATE_ON_WRITE,
but this  would still expose the recent updates on the leader to being lost during a short
interval
>3) without REPLICATE_ON_WRITE (or equivalently, read repair ) I would have to do CL=ALL
on read. then in this case, if the leader fails, all future reads fail. so for counters I
have to enable
>REPLICATE_ON_WRITE or set read_repair chance to a reasonably high value, and do read CL!=
ALL.

One thing about REPLICATE_ON_WRITE: the counter ticket has had quite a
bit of history and at first this option was not the default, but it is
now and you should *not* disable it unless you really know what you do
(and are aware that you could lose data and would require CL.ALL on
read for consistency). I'd go as far as said that this option should
probably not exist at all.

As for being exposed to losing recent updates during a short interval,
you don't. Or at least not more that with regular columns. That is, we
will honor the consistency level correctly and thus, as a end user,
you get the usual durability guarantees. The fact that we write to a
first replica before the other ones instead of parallelizing them is
an artifact of the implementation but we could change regular write to
do this too without changing any of the guarantee provided.

>apart from the questions, some thoughts on Counters:
>the idea of distributed counters can be seen, in distributed algorithms terms, as a state
machine (see Fred Schneider 93'),  where ideally we send the messages (delta increments) to
each node, and the final state (sum of deltas, or the counter value) is deduced independently
at each node.  in the current implementation, it's really not a distributed state machine,
since state is deduced only at the leader, and what is replicated is just the final state.
in fact, the data from different leaders are orthogonal, and within the data flow from one
leader, it's really just a master-slave system. then we realize that this system is prone
to single master failure.

Don't get fooled by the term 'leader': there is one leader *by
operation*, not one global leader. Again, the leader of an operation
is really just 'the first of the replica we're replicating to'.

It's not more a master-slave design than regular writes are because
they use a distinguished coordinator node for each operation. And it's
not prone to single node failure because if you do counter increments
at CL.QUORUM against say a cluster with RF=3, then you will still be
able to write and read even if one node is down and which node exactly
doesn't matter at all.

--
Sylvain

Mime
View raw message