cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From A J <>
Subject Re: New Chain for : Does Cassandra use vector clocks
Date Thu, 24 Feb 2011 19:16:35 GMT
>>but could be broken in case of a failed write<<
You can think of a scenario where R + W >N still leads to
inconsistency even for successful writes. Say you keep W=1 and R=N .
Lets say the one node where a write happened with success goes down
before it made to the other N-1 nodes. Lets say it goes down for good
and is unrecoverable. The only option is to build a new node from
scratch from other active nodes. This will lead to a write that was
lost and you will end up serving stale copy of it.

It is better to talk in terms of use cases and if cassandra will be a
fit for it. Otherwise unless you have W=R=N and fsync before each
write commit, there will be scope for inconsistency.

On Thu, Feb 24, 2011 at 1:25 PM, Anthony John <> wrote:
> I see the point - apologies for putting everyone through this!
> It was just militating against my mental model.
> In summary, here is my take away - simple stuff but - IMO - important to
> conclude this thread (I hope):-
> 1. I was splitting hair over a failed ( partial ) Q Write. Such an event
> should be immediately followed by the same write going to a connection on to
> another node ( potentially using connection caches of client implementations
> ) or a Read at CL of All. Because a write could have partially gone through.
> 2. Timestamps are used in determining the latest version ( correcting the
> false impression I was propagating)
> Finally, wrt "W + R > N for Q CL statement" holds, but could be broken in
> case of a failed write as it is unsure whether the new value got written on
>  any server or not. Is that a fair characterization ?
> Bottom line - unlike traditional DBMS, errors do not ensure automatic
> cleanup and revert back, app code has to follow up if  immediate - and not
> eventual -  consistency is desired. I made that leap in almost all cases - I
> think - but the case of a failed write.
> My bad and I can live with this!
> Regards,
> -JA
> On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne <>
> wrote:
>> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John <>
>> wrote:
>>> Completely understand!
>>> All that I am quibbling over is whether a CL of quorum guarantees
>>> consistency or not. That is what the documentation says - right. IF for a CL
>>> of Q read - it depends on which node returns read first to determine the
>>> actual returned result or other more convoluted conditions , then a Quorum
>>> read/write is not consistent, by any definition.
>> But that's the point. The definition of consistency we are talking about
>> has no meaning if you consider only a quorum read. The definition (which is
>> the de facto definition of consistency in 'eventually consistent') make
>> sense if we talk about a write followed by a read. And it is
>> considering succeeding write followed by succeeding read.
>> And that is the statement the wiki is making.
>> Honestly, we could debate forever on the definition of consistency and
>> whatnot. Cassandra guaranties that if you do a (succeeding) write on W
>> replica and then a (succeeding) read on R replica and if R+W>N, then it is
>> guaranteed that the read will see the preceding write. And this is what is
>> called consistency in the context of eventual consistency (which is not the
>> context of ACID).
>> If this is not the definition of consistency you had in mind then by all
>> mean, Cassandra probably don't guarantee this definition. But given that the
>> paragraph preceding what you pasted state clearly we are not talking about
>> ACID consistency, but eventual consistency, I don't think the wiki is making
>> any unfair statement.
>> That being said, the wiki may not be always as clear as it could. But it's
>> an editable wiki :)
>> --
>> Sylvain
>>> I can still use Cassandra, and will use it, luv it!!! But let us not make
>>> this statement on the Wiki architecture section:-
>>> -------------------------------------------------------------
>>> More specifically: R=read replica count W=write replica
>>> count N=replication factor Q=QUORUM (Q = N / 2 + 1)
>>> If W + R > N, you will have consistency
>>> W=1, R=N
>>> W=N, R=1
>>> W=Q, R=Q where Q = N / 2 + 1
>>> Cassandra provides consistency when R + W > N (read replica count + write
>>> replica count > replication factor).
>>> ----------------------------------------------------
>>> .
>>> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne <>
>>> wrote:
>>>> On Thu, Feb 24, 2011 at 6:01 PM, Anthony John <>
>>>> wrote:
>>>>> If you are correct and you are probably closer to the code - then CL
>>>>> Quorum does not guarantee a consistency.
>>>> If the operation succeed, it does (for some definition of consistency
>>>> which is, following reads at Quorum will be guaranteed to see the new value
>>>> of a update at quorum). If it fails, then no, it does not guarantee
>>>> consistency.
>>>> It is important to note that the word consistency has multiple meaning.
>>>> In particular, when we are talking of consistency in Cassandra, we are not
>>>> talking of the same definition as the C in ACID
>>>> (see:
>>>>> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne
>>>>> <> wrote:
>>>>>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John <>
>>>>>> wrote:
>>>>>>>> >>Time stamps are not used for conflict resolution
- unless is is
>>>>>>>> >> part of the application logic!!!
>>>>>>> >>What is you definition of conflict resolution ? Because
if you
>>>>>>> >> update twice the same column (which
>>>>>>> >>I'll call a conflict), then the timestamps are used to
decide which
>>>>>>> >> update wins (which I'll call a resolution).
>>>>>>> I understand what you are saying, and yes semantics is very important
>>>>>>> here. And yes we are responding to the immediate questions without
>>>>>>> all questions in the thread.
>>>>>>> The point being made here is that the timestamp of the column
is not
>>>>>>> used by Cassandra to figure out what data to return.
>>>>>> Not quite true.
>>>>>>> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3
>>>>>>> A Quorum  Write comes and add/updates the time stamp (TS2) of
>>>>>>> particular data element. It succeeds on N1 - fails on N2/3. So
the write is
>>>>>>> returned as failed - right ?
>>>>>>> Now Quorum read comes in for exactly the same piece of data that
>>>>>>> write failed for.
>>>>>>> So N1 has TS2 but both N2/3 have the old TS (say TS1)
>>>>>>> And the read succeeds - Will it return TS1 or TS2.
>>>>>>> I submit it will return TS1 - the old TS.
>>>>>> It all depends on which (first 2) nodes respond to the read (since
>>>>>> RF=3, that can any two of N1/N2/N3). If N1 is part of the two that
makes the
>>>>>> quorum, then TS2 will be returned, because cassandra will compare
>>>>>> timestamp and decide what to return based on this. If N2/N3 responds
>>>>>> however, both timestamp will be TS1 and so, after timestamp resolution,
>>>>>> will stil be TS1 that will be returned.
>>>>>> So yes timestamp is used for conflict resolution.
>>>>>> In your example, you could get TS1 back because a failed write can
>>>>>> you cluster in an inconsistent state. You'd have to retry the quorum
>>>>>> only when it succeeds can you be guaranteed that quorum read will
>>>>>> return TS2.
>>>>>> This is because when a write fails, Cassandra doesn't guarantee that
>>>>>> the write did not made it in (there is no revert).
>>>>>>> Are we on the same page with this interpretation ?
>>>>>>> Regards,
>>>>>>> -JA
>>>>>>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne
>>>>>>> <> wrote:
>>>>>>>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John
>>>>>>>> <> wrote:
>>>>>>>>> Sylvan,
>>>>>>>>> Time stamps are not used for conflict resolution - unless
is is
>>>>>>>>> part of the application logic!!!
>>>>>>>> What is you definition of conflict resolution ? Because if
>>>>>>>> update twice the same column (which
>>>>>>>> I'll call a conflict), then the timestamps are used to decide
>>>>>>>> update wins (which I'll call a resolution).
>>>>>>>>> You can have "lost updates" w/Cassandra. You need to
to use 3rd
>>>>>>>>> products - cages for e.g. - to get ACID type consistency.
>>>>>>>> Then again, you'll have to define what you are calling "lost
>>>>>>>> updates". Provided you use a reasonable consistency level,
>>>>>>>> provides fairly strong durability guarantee, so for some
definition you
>>>>>>>> don't "lose updates".
>>>>>>>> That being said, I never pretended that Cassandra provided
any ACID
>>>>>>>> guarantee. ACID relates to transaction, which Cassandra doesn't
support. If
>>>>>>>> we're talking about the guarantees of transaction, then by
all means,
>>>>>>>> cassandra won't provide it. And yes you can use cages or
the like to get
>>>>>>>> transaction. But that was not the point of the thread, was
it ? The thread
>>>>>>>> is about vector clocks, and that has nothing to do with transaction
>>>>>>>> clocks certainly don't give you transactions).
>>>>>>>> Sorry if I wasn't clear in my mail, but I was only responding
to why
>>>>>>>> so far I don't think vector clocks would really provide much
for Cassandra.
>>>>>>>> --
>>>>>>>> Sylvain
>>>>>>>>> -JA
>>>>>>>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne
>>>>>>>>> <> wrote:
>>>>>>>>>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John
>>>>>>>>>> <> wrote:
>>>>>>>>>>> Apologies : For some reason my response on the
original mail
>>>>>>>>>>> keeps bouncing back, thus this new one!
>>>>>>>>>>> > From the other hand, the same article says:
>>>>>>>>>>> > "For conditional writes to work, the condition
must be
>>>>>>>>>>> > evaluated at all update
>>>>>>>>>>> > sites before the write can be allowed to
>>>>>>>>>>> >
>>>>>>>>>>> > This means, that when doing such an update
CL=ALL must be used
>>>>>>>>>>> Sorry, but I am confused by that entire thread!
>>>>>>>>>>> Questions:-
>>>>>>>>>>> 1. Does Cassandra implement any kind of data
locking - at any
>>>>>>>>>>> granularity whether it be row/colF/Col ?
>>>>>>>>>> No locking, no.
>>>>>>>>>>> 2. If the answer to 1 above is NO! - how does
CL ALL prevent
>>>>>>>>>>> conflicts. Concurrent updates on exactly the
same piece of data on different
>>>>>>>>>>> nodes can still mess each other up, right ?
>>>>>>>>>> Not sure why you are taking CL.ALL specifically.
But in any CL,
>>>>>>>>>> updating the same piece of data means the same column
value. In that case,
>>>>>>>>>> the resolution rules are the following:
>>>>>>>>>>   - If the updates have a different timestamp,
keep the one with
>>>>>>>>>> the higher timestamp. That is, the more recent of
two updates win.
>>>>>>>>>>   - It the timestamps are the same, then it compares
the values
>>>>>>>>>> (byte comparison) and keep the highest value. This
is just to break ties in
>>>>>>>>>> a consistent manner.
>>>>>>>>>> So if you do two truly concurrent updates (that is
from two place
>>>>>>>>>> at the same instant), then you'll end with one of
the update. This is the
>>>>>>>>>> column level.
>>>>>>>>>> However, if that simple conflict detection/resolution
mechanism is
>>>>>>>>>> not good enough for some of your use case and you
need to keep two
>>>>>>>>>> concurrent updates, it is easy enough. Just make
sure that the update don't
>>>>>>>>>> end up in the same column. This is easily achieved
by appending some unique
>>>>>>>>>> identifier to the column name for instance. And when
reading, do a slice and
>>>>>>>>>> reconcile whatever you get back with whatever logic
make sense. If you do
>>>>>>>>>> that, congrats, you've roughly emulated what vector
clocks would do. Btw, no
>>>>>>>>>> locking or anything needed.
>>>>>>>>>> In my experience, for most things the timestamp resolution
>>>>>>>>>> enough. If the same user update twice it's profile
picture on you web site
>>>>>>>>>> at the same microsecond, it's usually fine to end
up with one of the two
>>>>>>>>>> pictures. In the rare case where you need something
more specific, using the
>>>>>>>>>> cassandra data model usually solves the problem easily.
The reason for not
>>>>>>>>>> having vector clocks in Cassandra is that so far,
we haven't really found
>>>>>>>>>> much example where it is no the case.
>>>>>>>>>> --
>>>>>>>>>> Sylvain

View raw message