cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Hanna <>
Subject Re: New Chain for : Does Cassandra use vector clocks
Date Fri, 25 Feb 2011 16:46:55 GMT
Yeah - no worries - I don't think anyone was thinking you were trying to drink kool-aid or
selling anything.  Jonathan was just pointing out thoughtful replies to his claims.

This past year, Michael Stonebraker with voltdb and other things seems to have tried to take
advantage of momentum behind systems like cassandra (as well as the backlash against nosql)
to make pretty bold claims, especially when considering that volt is an in memory database.
 So 1) he's kind of been using his pedigree as credibility in selling a new product and 2)
the voltdb marketing department makes heavy use of buzz words and hyperbole.

Nothing wrong with voltdb necessarily, it probably has its uses.  However, the way it's been
pitched by the company and by Stonebraker in particular seems disingenuous, self-serving,
and to me has very much tarnished his reputation as an objective luminary in the field of
computer science.

Maybe I'm taking that too far, but now every time I hear a statement by him, I have a grain
of salt at the ready.

On Feb 25, 2011, at 10:21 AM, A J wrote:

> Though you are not really implying that, I am not selling anything. I
> don't work for VoltDB. I had other issues for my use case with the
> software when I was evaluating it (their claim of durability is weak
> according to me. Though it does not matter I'd rather they call
> themselves NOSQL. they just give lip-service to SQL)
> I'd rather not drink any sort of kool-aid, get all sides (whatever the
> motive of the sides be) and be the judge myself for what I want to do.
> The thread was by someone who seems to be having difficulty wrapping
> head around the gives and takes of cassandra. maybe something else is
> better for their use case.
> Peace :)
> On Fri, Feb 25, 2011 at 10:39 AM, Jonathan Ellis <> wrote:
>> That article is heavily biased by "I am selling a competitor to Cassandra."
>> First, read Coda's original piece if you haven't:
>> Then, Jeff Darcy's response:
>> On Thu, Feb 24, 2011 at 2:56 PM, A J <> wrote:
>>> While we are at it, there's more to consider than just CAP in distributed :)
>>> On Thu, Feb 24, 2011 at 3:31 PM, Edward Capriolo <>
>>>> On Thu, Feb 24, 2011 at 3:03 PM, A J <> wrote:
>>>>> yes, that is difficult to digest and one has to be sure if the use
>>>>> case can afford it.
>>>>> Some other NOSQL databases deals with it differently (though I don't
>>>>> think any of them use atomic 2-phase commit). MongoDB for example will
>>>>> ask you to read from the node you wrote first (primary node) unless
>>>>> you are ok with eventual consistency. If the write did not make to
>>>>> majority of other nodes, it will be rolled-back from the original
>>>>> primary when it comes up again as a secondary.
>>>>> In some cases, you still could server either new value (that was
>>>>> returned as failed) or the old one. But it is different from Cassandra
>>>>> in the sense that Cassandra will never rollback.
>>>>> On Thu, Feb 24, 2011 at 2:47 PM, Anthony John <>
>>>>>> The leap of faith here is that an error does not mean a clean backing
out to
>>>>>> prior state - as we are used to with databases. It means that the
>>>>>> in error could have gone through partially
>>>>>> Again, this is not an absolutely unfamiliar territory and can be
dealt with.
>>>>>> -JA
>>>>>> On Thu, Feb 24, 2011 at 1:16 PM, A J <> wrote:
>>>>>>>>> but could be broken in case of a failed write<<
>>>>>>> You can think of a scenario where R + W >N still leads to
>>>>>>> inconsistency even for successful writes. Say you keep W=1 and
R=N .
>>>>>>> Lets say the one node where a write happened with success goes
>>>>>>> before it made to the other N-1 nodes. Lets say it goes down
for good
>>>>>>> and is unrecoverable. The only option is to build a new node
>>>>>>> scratch from other active nodes. This will lead to a write that
>>>>>>> lost and you will end up serving stale copy of it.
>>>>>>> It is better to talk in terms of use cases and if cassandra will
be a
>>>>>>> fit for it. Otherwise unless you have W=R=N and fsync before
>>>>>>> write commit, there will be scope for inconsistency.
>>>>>>> On Thu, Feb 24, 2011 at 1:25 PM, Anthony John <>
>>>>>>> wrote:
>>>>>>>> I see the point - apologies for putting everyone through
>>>>>>>> It was just militating against my mental model.
>>>>>>>> In summary, here is my take away - simple stuff but - IMO
- important to
>>>>>>>> conclude this thread (I hope):-
>>>>>>>> 1. I was splitting hair over a failed ( partial ) Q Write.
Such an event
>>>>>>>> should be immediately followed by the same write going to
a connection
>>>>>>>> on to
>>>>>>>> another node ( potentially using connection caches of client
>>>>>>>> implementations
>>>>>>>> ) or a Read at CL of All. Because a write could have partially
>>>>>>>> through.
>>>>>>>> 2. Timestamps are used in determining the latest version
( correcting
>>>>>>>> the
>>>>>>>> false impression I was propagating)
>>>>>>>> Finally, wrt "W + R > N for Q CL statement" holds, but
could be broken
>>>>>>>> in
>>>>>>>> case of a failed write as it is unsure whether the new value
got written
>>>>>>>> on
>>>>>>>>  any server or not. Is that a fair characterization ?
>>>>>>>> Bottom line - unlike traditional DBMS, errors do not ensure
>>>>>>>> cleanup and revert back, app code has to follow up if  immediate
- and
>>>>>>>> not
>>>>>>>> eventual -  consistency is desired. I made that leap in almost
all cases
>>>>>>>> - I
>>>>>>>> think - but the case of a failed write.
>>>>>>>> My bad and I can live with this!
>>>>>>>> Regards,
>>>>>>>> -JA
>>>>>>>> On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne
>>>>>>>> <>
>>>>>>>> wrote:
>>>>>>>>> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John <>
>>>>>>>>> wrote:
>>>>>>>>>> Completely understand!
>>>>>>>>>> All that I am quibbling over is whether a CL of quorum
>>>>>>>>>> consistency or not. That is what the documentation
says - right. IF
>>>>>>>>>> for a CL
>>>>>>>>>> of Q read - it depends on which node returns read
first to determine
>>>>>>>>>> the
>>>>>>>>>> actual returned result or other more convoluted conditions
, then a
>>>>>>>>>> Quorum
>>>>>>>>>> read/write is not consistent, by any definition.
>>>>>>>>> But that's the point. The definition of consistency we
are talking
>>>>>>>>> about
>>>>>>>>> has no meaning if you consider only a quorum read. The
>>>>>>>>> (which is
>>>>>>>>> the de facto definition of consistency in 'eventually
consistent') make
>>>>>>>>> sense if we talk about a write followed by a read. And
it is
>>>>>>>>> considering succeeding write followed by succeeding read.
>>>>>>>>> And that is the statement the wiki is making.
>>>>>>>>> Honestly, we could debate forever on the definition of
consistency and
>>>>>>>>> whatnot. Cassandra guaranties that if you do a (succeeding)
write on W
>>>>>>>>> replica and then a (succeeding) read on R replica and
if R+W>N, then it
>>>>>>>>> is
>>>>>>>>> guaranteed that the read will see the preceding write.
And this is what
>>>>>>>>> is
>>>>>>>>> called consistency in the context of eventual consistency
(which is not
>>>>>>>>> the
>>>>>>>>> context of ACID).
>>>>>>>>> If this is not the definition of consistency you had
in mind then by
>>>>>>>>> all
>>>>>>>>> mean, Cassandra probably don't guarantee this definition.
But given
>>>>>>>>> that the
>>>>>>>>> paragraph preceding what you pasted state clearly we
are not talking
>>>>>>>>> about
>>>>>>>>> ACID consistency, but eventual consistency, I don't think
the wiki is
>>>>>>>>> making
>>>>>>>>> any unfair statement.
>>>>>>>>> That being said, the wiki may not be always as clear
as it could. But
>>>>>>>>> it's
>>>>>>>>> an editable wiki :)
>>>>>>>>> --
>>>>>>>>> Sylvain
>>>>>>>>>> I can still use Cassandra, and will use it, luv it!!!
But let us not
>>>>>>>>>> make
>>>>>>>>>> this statement on the Wiki architecture section:-
>>>>>>>>>> -------------------------------------------------------------
>>>>>>>>>> More specifically: R=read replica count W=write replica
>>>>>>>>>> count N=replication factor Q=QUORUM (Q = N / 2 +
>>>>>>>>>> If W + R > N, you will have consistency
>>>>>>>>>> W=1, R=N
>>>>>>>>>> W=N, R=1
>>>>>>>>>> W=Q, R=Q where Q = N / 2 + 1
>>>>>>>>>> Cassandra provides consistency when R + W > N
(read replica count
>>>>>>>>>> + write
>>>>>>>>>> replica count > replication factor).
>>>>>>>>>> ----------------------------------------------------
>>>>>>>>>> .
>>>>>>>>>> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne
>>>>>>>>>> <>
>>>>>>>>>> wrote:
>>>>>>>>>>> On Thu, Feb 24, 2011 at 6:01 PM, Anthony John
>>>>>>>>>>> wrote:
>>>>>>>>>>>> If you are correct and you are probably closer
to the code - then CL
>>>>>>>>>>>> of
>>>>>>>>>>>> Quorum does not guarantee a consistency.
>>>>>>>>>>> If the operation succeed, it does (for some definition
of consistency
>>>>>>>>>>> which is, following reads at Quorum will be guaranteed
to see the new
>>>>>>>>>>> value
>>>>>>>>>>> of a update at quorum). If it fails, then no,
it does not guarantee
>>>>>>>>>>> consistency.
>>>>>>>>>>> It is important to note that the word consistency
has multiple
>>>>>>>>>>> meaning.
>>>>>>>>>>> In particular, when we are talking of consistency
in Cassandra, we
>>>>>>>>>>> are not
>>>>>>>>>>> talking of the same definition as the C in ACID
>>>>>>>>>>> (see:
>>>>>>>>>>>> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain
>>>>>>>>>>>> <> wrote:
>>>>>>>>>>>>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony
>>>>>>>>>>>>> <>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> Time stamps are not used
for conflict resolution - unless is is
>>>>>>>>>>>>>>>>> part of the application
>>>>>>>>>>>>>>>> What is you definition of
conflict resolution ? Because if you
>>>>>>>>>>>>>>>> update twice the same column
>>>>>>>>>>>>>>>> I'll call a conflict), then
the timestamps are used to decide
>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>> update wins (which I'll call
a resolution).
>>>>>>>>>>>>>> I understand what you are saying,
and yes semantics is very
>>>>>>>>>>>>>> important
>>>>>>>>>>>>>> here. And yes we are responding to
the immediate questions without
>>>>>>>>>>>>>> covering
>>>>>>>>>>>>>> all questions in the thread.
>>>>>>>>>>>>>> The point being made here is that
the timestamp of the column is
>>>>>>>>>>>>>> not
>>>>>>>>>>>>>> used by Cassandra to figure out what
data to return.
>>>>>>>>>>>>> Not quite true.
>>>>>>>>>>>>>> E.g. - Quorum is 2 nodes - and RF
of 3 over N1/2/3
>>>>>>>>>>>>>> A Quorum  Write comes and add/updates
the time stamp (TS2) of a
>>>>>>>>>>>>>> particular data element. It succeeds
on N1 - fails on N2/3. So the
>>>>>>>>>>>>>> write is
>>>>>>>>>>>>>> returned as failed - right ?
>>>>>>>>>>>>>> Now Quorum read comes in for exactly
the same piece of data that
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> write failed for.
>>>>>>>>>>>>>> So N1 has TS2 but both N2/3 have
the old TS (say TS1)
>>>>>>>>>>>>>> And the read succeeds - Will it return
TS1 or TS2.
>>>>>>>>>>>>>> I submit it will return TS1 - the
old TS.
>>>>>>>>>>>>> It all depends on which (first 2) nodes
respond to the read (since
>>>>>>>>>>>>> RF=3, that can any two of N1/N2/N3).
If N1 is part of the two that
>>>>>>>>>>>>> makes the
>>>>>>>>>>>>> quorum, then TS2 will be returned, because
cassandra will compare
>>>>>>>>>>>>> the
>>>>>>>>>>>>> timestamp and decide what to return based
on this. If N2/N3
>>>>>>>>>>>>> responds
>>>>>>>>>>>>> however, both timestamp will be TS1 and
so, after timestamp
>>>>>>>>>>>>> resolution, it
>>>>>>>>>>>>> will stil be TS1 that will be returned.
>>>>>>>>>>>>> So yes timestamp is used for conflict
>>>>>>>>>>>>> In your example, you could get TS1 back
because a failed write can
>>>>>>>>>>>>> let
>>>>>>>>>>>>> you cluster in an inconsistent state.
You'd have to retry the
>>>>>>>>>>>>> quorum and
>>>>>>>>>>>>> only when it succeeds can you be guaranteed
that quorum read will
>>>>>>>>>>>>> always
>>>>>>>>>>>>> return TS2.
>>>>>>>>>>>>> This is because when a write fails, Cassandra
doesn't guarantee
>>>>>>>>>>>>> that
>>>>>>>>>>>>> the write did not made it in (there is
no revert).
>>>>>>>>>>>>>> Are we on the same page with this
interpretation ?
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> -JA
>>>>>>>>>>>>>> On Thu, Feb 24, 2011 at 10:12 AM,
Sylvain Lebresne
>>>>>>>>>>>>>> <> wrote:
>>>>>>>>>>>>>>> On Thu, Feb 24, 2011 at 4:52
PM, Anthony John
>>>>>>>>>>>>>>> <>
>>>>>>>>>>>>>>>> Sylvan,
>>>>>>>>>>>>>>>> Time stamps are not used
for conflict resolution - unless is is
>>>>>>>>>>>>>>>> part of the application logic!!!
>>>>>>>>>>>>>>> What is you definition of conflict
resolution ? Because if you
>>>>>>>>>>>>>>> update twice the same column
>>>>>>>>>>>>>>> I'll call a conflict), then the
timestamps are used to decide
>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>> update wins (which I'll call
a resolution).
>>>>>>>>>>>>>>>> You can have "lost updates"
w/Cassandra. You need to to use 3rd
>>>>>>>>>>>>>>>> products - cages for e.g.
- to get ACID type consistency.
>>>>>>>>>>>>>>> Then again, you'll have to define
what you are calling "lost
>>>>>>>>>>>>>>> updates". Provided you use a
reasonable consistency level,
>>>>>>>>>>>>>>> Cassandra
>>>>>>>>>>>>>>> provides fairly strong durability
guarantee, so for some
>>>>>>>>>>>>>>> definition you
>>>>>>>>>>>>>>> don't "lose updates".
>>>>>>>>>>>>>>> That being said, I never pretended
that Cassandra provided any
>>>>>>>>>>>>>>> ACID
>>>>>>>>>>>>>>> guarantee. ACID relates to transaction,
which Cassandra doesn't
>>>>>>>>>>>>>>> support. If
>>>>>>>>>>>>>>> we're talking about the guarantees
of transaction, then by all
>>>>>>>>>>>>>>> means,
>>>>>>>>>>>>>>> cassandra won't provide it. And
yes you can use cages or the like
>>>>>>>>>>>>>>> to get
>>>>>>>>>>>>>>> transaction. But that was not
the point of the thread, was it ?
>>>>>>>>>>>>>>> The thread
>>>>>>>>>>>>>>> is about vector clocks, and that
has nothing to do with
>>>>>>>>>>>>>>> transaction (vector
>>>>>>>>>>>>>>> clocks certainly don't give you
>>>>>>>>>>>>>>> Sorry if I wasn't clear in my
mail, but I was only responding to
>>>>>>>>>>>>>>> why
>>>>>>>>>>>>>>> so far I don't think vector clocks
would really provide much for
>>>>>>>>>>>>>>> Cassandra.
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Sylvain
>>>>>>>>>>>>>>>> -JA
>>>>>>>>>>>>>>>> On Thu, Feb 24, 2011 at 7:41
AM, Sylvain Lebresne
>>>>>>>>>>>>>>>> <>
>>>>>>>>>>>>>>>>> On Thu, Feb 24, 2011
at 3:22 AM, Anthony John
>>>>>>>>>>>>>>>>> <>
>>>>>>>>>>>>>>>>>> Apologies : For some
reason my response on the original mail
>>>>>>>>>>>>>>>>>> keeps bouncing back,
thus this new one!
>>>>>>>>>>>>>>>>>>> From the other
hand, the same article says:
>>>>>>>>>>>>>>>>>>> "For conditional
writes to work, the condition must be
>>>>>>>>>>>>>>>>>>> evaluated at
all update
>>>>>>>>>>>>>>>>>>> sites before
the write can be allowed to succeed."
>>>>>>>>>>>>>>>>>>> This means, that
when doing such an update CL=ALL must be
>>>>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>>> Sorry, but I am confused
by that entire thread!
>>>>>>>>>>>>>>>>>> Questions:-
>>>>>>>>>>>>>>>>>> 1. Does Cassandra
implement any kind of data locking - at any
>>>>>>>>>>>>>>>>>> granularity whether
it be row/colF/Col ?
>>>>>>>>>>>>>>>>> No locking, no.
>>>>>>>>>>>>>>>>>> 2. If the answer
to 1 above is NO! - how does CL ALL prevent
>>>>>>>>>>>>>>>>>> conflicts. Concurrent
updates on exactly the same piece of
>>>>>>>>>>>>>>>>>> data on different
>>>>>>>>>>>>>>>>>> nodes can still mess
each other up, right ?
>>>>>>>>>>>>>>>>> Not sure why you are
taking CL.ALL specifically. But in any CL,
>>>>>>>>>>>>>>>>> updating the same piece
of data means the same column value. In
>>>>>>>>>>>>>>>>> that case,
>>>>>>>>>>>>>>>>> the resolution rules
are the following:
>>>>>>>>>>>>>>>>>   - If the updates have
a different timestamp, keep the one
>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>> the higher timestamp.
That is, the more recent of two updates
>>>>>>>>>>>>>>>>> win.
>>>>>>>>>>>>>>>>>   - It the timestamps
are the same, then it compares the values
>>>>>>>>>>>>>>>>> (byte comparison) and
keep the highest value. This is just to
>>>>>>>>>>>>>>>>> break ties in
>>>>>>>>>>>>>>>>> a consistent manner.
>>>>>>>>>>>>>>>>> So if you do two truly
concurrent updates (that is from two
>>>>>>>>>>>>>>>>> place
>>>>>>>>>>>>>>>>> at the same instant),
then you'll end with one of the update.
>>>>>>>>>>>>>>>>> This is the
>>>>>>>>>>>>>>>>> column level.
>>>>>>>>>>>>>>>>> However, if that simple
conflict detection/resolution mechanism
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> not good enough for some
of your use case and you need to keep
>>>>>>>>>>>>>>>>> two
>>>>>>>>>>>>>>>>> concurrent updates, it
is easy enough. Just make sure that the
>>>>>>>>>>>>>>>>> update don't
>>>>>>>>>>>>>>>>> end up in the same column.
This is easily achieved by appending
>>>>>>>>>>>>>>>>> some unique
>>>>>>>>>>>>>>>>> identifier to the column
name for instance. And when reading,
>>>>>>>>>>>>>>>>> do a slice and
>>>>>>>>>>>>>>>>> reconcile whatever you
get back with whatever logic make sense.
>>>>>>>>>>>>>>>>> If you do
>>>>>>>>>>>>>>>>> that, congrats, you've
roughly emulated what vector clocks
>>>>>>>>>>>>>>>>> would do. Btw, no
>>>>>>>>>>>>>>>>> locking or anything needed.
>>>>>>>>>>>>>>>>> In my experience, for
most things the timestamp resolution is
>>>>>>>>>>>>>>>>> enough. If the same user
update twice it's profile picture on
>>>>>>>>>>>>>>>>> you web site
>>>>>>>>>>>>>>>>> at the same microsecond,
it's usually fine to end up with one
>>>>>>>>>>>>>>>>> of the two
>>>>>>>>>>>>>>>>> pictures. In the rare
case where you need something more
>>>>>>>>>>>>>>>>> specific, using the
>>>>>>>>>>>>>>>>> cassandra data model
usually solves the problem easily. The
>>>>>>>>>>>>>>>>> reason for not
>>>>>>>>>>>>>>>>> having vector clocks
in Cassandra is that so far, we haven't
>>>>>>>>>>>>>>>>> really found
>>>>>>>>>>>>>>>>> much example where it
is no the case.
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Sylvain
>>>> Just to make a note the "EVENTUAL" in eventual consistency could be a
>>>> time that is less then 1ms.
>>>> I have a program that demonstrates that "eventual" means if i write
>>>> data at the weakest level, and read it back from a random another node
>>>> as soon as possible. 99% I see the update. I can share the code if you
>>>> would like.
>>>> Remember
>>>> ...but there is no reference frame in which the two events can occur
>>>> at the same time...
>>>> As to MongoDB references ....Yes! most of the noSQL work differently.
>>>> They each approach CAP
>>>> in a
>>>> different way.
>>>> Cassandra does not lock (it is no secret). But remember, you can not
>>>> have it all pick 2/3 from CAP.
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of DataStax, the source for professional Cassandra support

View raw message