cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: New Chain for : Does Cassandra use vector clocks
Date Thu, 24 Feb 2011 22:20:18 GMT
On Thu, Feb 24, 2011 at 3:56 PM, A J <s5alye@gmail.com> wrote:
> While we are at it, there's more to consider than just CAP in distributed :)
> http://voltdb.com/blog/clarifications-cap-theorem-and-data-related-errors
>
> On Thu, Feb 24, 2011 at 3:31 PM, Edward Capriolo <edlinuxguru@gmail.com> wrote:
>> On Thu, Feb 24, 2011 at 3:03 PM, A J <s5alye@gmail.com> wrote:
>>> yes, that is difficult to digest and one has to be sure if the use
>>> case can afford it.
>>>
>>> Some other NOSQL databases deals with it differently (though I don't
>>> think any of them use atomic 2-phase commit). MongoDB for example will
>>> ask you to read from the node you wrote first (primary node) unless
>>> you are ok with eventual consistency. If the write did not make to
>>> majority of other nodes, it will be rolled-back from the original
>>> primary when it comes up again as a secondary.
>>> In some cases, you still could server either new value (that was
>>> returned as failed) or the old one. But it is different from Cassandra
>>> in the sense that Cassandra will never rollback.
>>>
>>>
>>>
>>> On Thu, Feb 24, 2011 at 2:47 PM, Anthony John <chirayithaj@gmail.com> wrote:
>>>> The leap of faith here is that an error does not mean a clean backing out
to
>>>> prior state - as we are used to with databases. It means that the operation
>>>> in error could have gone through partially
>>>>
>>>> Again, this is not an absolutely unfamiliar territory and can be dealt with.
>>>> -JA
>>>> On Thu, Feb 24, 2011 at 1:16 PM, A J <s5alye@gmail.com> wrote:
>>>>>
>>>>> >>but could be broken in case of a failed write<<
>>>>> You can think of a scenario where R + W >N still leads to
>>>>> inconsistency even for successful writes. Say you keep W=1 and R=N .
>>>>> Lets say the one node where a write happened with success goes down
>>>>> before it made to the other N-1 nodes. Lets say it goes down for good
>>>>> and is unrecoverable. The only option is to build a new node from
>>>>> scratch from other active nodes. This will lead to a write that was
>>>>> lost and you will end up serving stale copy of it.
>>>>>
>>>>> It is better to talk in terms of use cases and if cassandra will be a
>>>>> fit for it. Otherwise unless you have W=R=N and fsync before each
>>>>> write commit, there will be scope for inconsistency.
>>>>>
>>>>>
>>>>> On Thu, Feb 24, 2011 at 1:25 PM, Anthony John <chirayithaj@gmail.com>
>>>>> wrote:
>>>>> > I see the point - apologies for putting everyone through this!
>>>>> > It was just militating against my mental model.
>>>>> > In summary, here is my take away - simple stuff but - IMO - important
to
>>>>> > conclude this thread (I hope):-
>>>>> > 1. I was splitting hair over a failed ( partial ) Q Write. Such
an event
>>>>> > should be immediately followed by the same write going to a connection
>>>>> > on to
>>>>> > another node ( potentially using connection caches of client
>>>>> > implementations
>>>>> > ) or a Read at CL of All. Because a write could have partially gone
>>>>> > through.
>>>>> > 2. Timestamps are used in determining the latest version ( correcting
>>>>> > the
>>>>> > false impression I was propagating)
>>>>> > Finally, wrt "W + R > N for Q CL statement" holds, but could
be broken
>>>>> > in
>>>>> > case of a failed write as it is unsure whether the new value got
written
>>>>> > on
>>>>> >  any server or not. Is that a fair characterization ?
>>>>> > Bottom line - unlike traditional DBMS, errors do not ensure automatic
>>>>> > cleanup and revert back, app code has to follow up if  immediate
- and
>>>>> > not
>>>>> > eventual -  consistency is desired. I made that leap in almost
all cases
>>>>> > - I
>>>>> > think - but the case of a failed write.
>>>>> > My bad and I can live with this!
>>>>> > Regards,
>>>>> > -JA
>>>>> >
>>>>> > On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne
>>>>> > <sylvain@datastax.com>
>>>>> > wrote:
>>>>> >>
>>>>> >> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John <chirayithaj@gmail.com>
>>>>> >> wrote:
>>>>> >>>
>>>>> >>> Completely understand!
>>>>> >>> All that I am quibbling over is whether a CL of quorum guarantees
>>>>> >>> consistency or not. That is what the documentation says
- right. IF
>>>>> >>> for a CL
>>>>> >>> of Q read - it depends on which node returns read first
to determine
>>>>> >>> the
>>>>> >>> actual returned result or other more convoluted conditions
, then a
>>>>> >>> Quorum
>>>>> >>> read/write is not consistent, by any definition.
>>>>> >>
>>>>> >> But that's the point. The definition of consistency we are talking
>>>>> >> about
>>>>> >> has no meaning if you consider only a quorum read. The definition
>>>>> >> (which is
>>>>> >> the de facto definition of consistency in 'eventually consistent')
make
>>>>> >> sense if we talk about a write followed by a read. And it is
>>>>> >> considering succeeding write followed by succeeding read.
>>>>> >> And that is the statement the wiki is making.
>>>>> >> Honestly, we could debate forever on the definition of consistency
and
>>>>> >> whatnot. Cassandra guaranties that if you do a (succeeding)
write on W
>>>>> >> replica and then a (succeeding) read on R replica and if R+W>N,
then it
>>>>> >> is
>>>>> >> guaranteed that the read will see the preceding write. And this
is what
>>>>> >> is
>>>>> >> called consistency in the context of eventual consistency (which
is not
>>>>> >> the
>>>>> >> context of ACID).
>>>>> >> If this is not the definition of consistency you had in mind
then by
>>>>> >> all
>>>>> >> mean, Cassandra probably don't guarantee this definition. But
given
>>>>> >> that the
>>>>> >> paragraph preceding what you pasted state clearly we are not
talking
>>>>> >> about
>>>>> >> ACID consistency, but eventual consistency, I don't think the
wiki is
>>>>> >> making
>>>>> >> any unfair statement.
>>>>> >> That being said, the wiki may not be always as clear as it could.
But
>>>>> >> it's
>>>>> >> an editable wiki :)
>>>>> >> --
>>>>> >> Sylvain
>>>>> >>
>>>>> >>>
>>>>> >>> I can still use Cassandra, and will use it, luv it!!! But
let us not
>>>>> >>> make
>>>>> >>> this statement on the Wiki architecture section:-
>>>>> >>> -------------------------------------------------------------
>>>>> >>>
>>>>> >>> More specifically: R=read replica count W=write replica
>>>>> >>> count N=replication factor Q=QUORUM (Q = N / 2 + 1)
>>>>> >>>
>>>>> >>> If W + R > N, you will have consistency
>>>>> >>>
>>>>> >>> W=1, R=N
>>>>> >>> W=N, R=1
>>>>> >>> W=Q, R=Q where Q = N / 2 + 1
>>>>> >>>
>>>>> >>> Cassandra provides consistency when R + W > N (read replica
count
>>>>> >>> + write
>>>>> >>> replica count > replication factor).
>>>>> >>>
>>>>> >>> ----------------------------------------------------
>>>>> >>>
>>>>> >>> .
>>>>> >>>
>>>>> >>> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne
>>>>> >>> <sylvain@datastax.com>
>>>>> >>> wrote:
>>>>> >>>>
>>>>> >>>> On Thu, Feb 24, 2011 at 6:01 PM, Anthony John <chirayithaj@gmail.com>
>>>>> >>>> wrote:
>>>>> >>>>>
>>>>> >>>>> If you are correct and you are probably closer to
the code - then CL
>>>>> >>>>> of
>>>>> >>>>> Quorum does not guarantee a consistency.
>>>>> >>>>
>>>>> >>>> If the operation succeed, it does (for some definition
of consistency
>>>>> >>>> which is, following reads at Quorum will be guaranteed
to see the new
>>>>> >>>> value
>>>>> >>>> of a update at quorum). If it fails, then no, it does
not guarantee
>>>>> >>>> consistency.
>>>>> >>>> It is important to note that the word consistency has
multiple
>>>>> >>>> meaning.
>>>>> >>>> In particular, when we are talking of consistency in
Cassandra, we
>>>>> >>>> are not
>>>>> >>>> talking of the same definition as the C in ACID
>>>>> >>>>
>>>>> >>>> (see: http://www.allthingsdistributed.com/2007/12/eventually_consistent.html)
>>>>> >>>>>
>>>>> >>>>> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne
>>>>> >>>>> <sylvain@datastax.com> wrote:
>>>>> >>>>>>
>>>>> >>>>>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John
>>>>> >>>>>> <chirayithaj@gmail.com>
>>>>> >>>>>> wrote:
>>>>> >>>>>>>>
>>>>> >>>>>>>> >>Time stamps are not used for
conflict resolution - unless is is
>>>>> >>>>>>>> >> part of the application logic!!!
>>>>> >>>>>>>
>>>>> >>>>>>> >>What is you definition of conflict
resolution ? Because if you
>>>>> >>>>>>> >> update twice the same column (which
>>>>> >>>>>>> >>I'll call a conflict), then the
timestamps are used to decide
>>>>> >>>>>>> >> which
>>>>> >>>>>>> >> update wins (which I'll call a
resolution).
>>>>> >>>>>>> I understand what you are saying, and yes
semantics is very
>>>>> >>>>>>> important
>>>>> >>>>>>> here. And yes we are responding to the immediate
questions without
>>>>> >>>>>>> covering
>>>>> >>>>>>> all questions in the thread.
>>>>> >>>>>>> The point being made here is that the timestamp
of the column is
>>>>> >>>>>>> not
>>>>> >>>>>>> used by Cassandra to figure out what data
to return.
>>>>> >>>>>>
>>>>> >>>>>> Not quite true.
>>>>> >>>>>>>
>>>>> >>>>>>> E.g. - Quorum is 2 nodes - and RF of 3 over
N1/2/3
>>>>> >>>>>>> A Quorum  Write comes and add/updates the
time stamp (TS2) of a
>>>>> >>>>>>> particular data element. It succeeds on
N1 - fails on N2/3. So the
>>>>> >>>>>>> write is
>>>>> >>>>>>> returned as failed - right ?
>>>>> >>>>>>> Now Quorum read comes in for exactly the
same piece of data that
>>>>> >>>>>>> the
>>>>> >>>>>>> write failed for.
>>>>> >>>>>>> So N1 has TS2 but both N2/3 have the old
TS (say TS1)
>>>>> >>>>>>> And the read succeeds - Will it return TS1
or TS2.
>>>>> >>>>>>> I submit it will return TS1 - the old TS.
>>>>> >>>>>>
>>>>> >>>>>> It all depends on which (first 2) nodes respond
to the read (since
>>>>> >>>>>> RF=3, that can any two of N1/N2/N3). If N1 is
part of the two that
>>>>> >>>>>> makes the
>>>>> >>>>>> quorum, then TS2 will be returned, because cassandra
will compare
>>>>> >>>>>> the
>>>>> >>>>>> timestamp and decide what to return based on
this. If N2/N3
>>>>> >>>>>> responds
>>>>> >>>>>> however, both timestamp will be TS1 and so,
after timestamp
>>>>> >>>>>> resolution, it
>>>>> >>>>>> will stil be TS1 that will be returned.
>>>>> >>>>>> So yes timestamp is used for conflict resolution.
>>>>> >>>>>> In your example, you could get TS1 back because
a failed write can
>>>>> >>>>>> let
>>>>> >>>>>> you cluster in an inconsistent state. You'd
have to retry the
>>>>> >>>>>> quorum and
>>>>> >>>>>> only when it succeeds can you be guaranteed
that quorum read will
>>>>> >>>>>> always
>>>>> >>>>>> return TS2.
>>>>> >>>>>> This is because when a write fails, Cassandra
doesn't guarantee
>>>>> >>>>>> that
>>>>> >>>>>> the write did not made it in (there is no revert).
>>>>> >>>>>>
>>>>> >>>>>>>
>>>>> >>>>>>> Are we on the same page with this interpretation
?
>>>>> >>>>>>> Regards,
>>>>> >>>>>>> -JA
>>>>> >>>>>>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain
Lebresne
>>>>> >>>>>>> <sylvain@datastax.com> wrote:
>>>>> >>>>>>>>
>>>>> >>>>>>>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony
John
>>>>> >>>>>>>> <chirayithaj@gmail.com> wrote:
>>>>> >>>>>>>>>
>>>>> >>>>>>>>> Sylvan,
>>>>> >>>>>>>>> Time stamps are not used for conflict
resolution - unless is is
>>>>> >>>>>>>>> part of the application logic!!!
>>>>> >>>>>>>>
>>>>> >>>>>>>> What is you definition of conflict resolution
? Because if you
>>>>> >>>>>>>> update twice the same column (which
>>>>> >>>>>>>> I'll call a conflict), then the timestamps
are used to decide
>>>>> >>>>>>>> which
>>>>> >>>>>>>> update wins (which I'll call a resolution).
>>>>> >>>>>>>>
>>>>> >>>>>>>>>
>>>>> >>>>>>>>> You can have "lost updates" w/Cassandra.
You need to to use 3rd
>>>>> >>>>>>>>> products - cages for e.g. - to get
ACID type consistency.
>>>>> >>>>>>>>
>>>>> >>>>>>>> Then again, you'll have to define what
you are calling "lost
>>>>> >>>>>>>> updates". Provided you use a reasonable
consistency level,
>>>>> >>>>>>>> Cassandra
>>>>> >>>>>>>> provides fairly strong durability guarantee,
so for some
>>>>> >>>>>>>> definition you
>>>>> >>>>>>>> don't "lose updates".
>>>>> >>>>>>>> That being said, I never pretended
that Cassandra provided any
>>>>> >>>>>>>> ACID
>>>>> >>>>>>>> guarantee. ACID relates to transaction,
which Cassandra doesn't
>>>>> >>>>>>>> support. If
>>>>> >>>>>>>> we're talking about the guarantees of
transaction, then by all
>>>>> >>>>>>>> means,
>>>>> >>>>>>>> cassandra won't provide it. And yes
you can use cages or the like
>>>>> >>>>>>>> to get
>>>>> >>>>>>>> transaction. But that was not the point
of the thread, was it ?
>>>>> >>>>>>>> The thread
>>>>> >>>>>>>> is about vector clocks, and that has
nothing to do with
>>>>> >>>>>>>> transaction (vector
>>>>> >>>>>>>> clocks certainly don't give you transactions).
>>>>> >>>>>>>> Sorry if I wasn't clear in my mail,
but I was only responding to
>>>>> >>>>>>>> why
>>>>> >>>>>>>> so far I don't think vector clocks would
really provide much for
>>>>> >>>>>>>> Cassandra.
>>>>> >>>>>>>> --
>>>>> >>>>>>>> Sylvain
>>>>> >>>>>>>>
>>>>> >>>>>>>>>
>>>>> >>>>>>>>> -JA
>>>>> >>>>>>>>>
>>>>> >>>>>>>>> On Thu, Feb 24, 2011 at 7:41 AM,
Sylvain Lebresne
>>>>> >>>>>>>>> <sylvain@datastax.com> wrote:
>>>>> >>>>>>>>>>
>>>>> >>>>>>>>>> On Thu, Feb 24, 2011 at 3:22
AM, Anthony John
>>>>> >>>>>>>>>> <chirayithaj@gmail.com>
wrote:
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> Apologies : For some reason
my response on the original mail
>>>>> >>>>>>>>>>> keeps bouncing back, thus
this new one!
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> > From the other hand,
the same article says:
>>>>> >>>>>>>>>>> > "For conditional writes
to work, the condition must be
>>>>> >>>>>>>>>>> > evaluated at all update
>>>>> >>>>>>>>>>> > sites before the write
can be allowed to succeed."
>>>>> >>>>>>>>>>> >
>>>>> >>>>>>>>>>> > This means, that when
doing such an update CL=ALL must be
>>>>> >>>>>>>>>>> > used
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> Sorry, but I am confused
by that entire thread!
>>>>> >>>>>>>>>>> Questions:-
>>>>> >>>>>>>>>>> 1. Does Cassandra implement
any kind of data locking - at any
>>>>> >>>>>>>>>>> granularity whether it be
row/colF/Col ?
>>>>> >>>>>>>>>>
>>>>> >>>>>>>>>> No locking, no.
>>>>> >>>>>>>>>>
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> 2. If the answer to 1 above
is NO! - how does CL ALL prevent
>>>>> >>>>>>>>>>> conflicts. Concurrent updates
on exactly the same piece of
>>>>> >>>>>>>>>>> data on different
>>>>> >>>>>>>>>>> nodes can still mess each
other up, right ?
>>>>> >>>>>>>>>>
>>>>> >>>>>>>>>> Not sure why you are taking
CL.ALL specifically. But in any CL,
>>>>> >>>>>>>>>> updating the same piece of data
means the same column value. In
>>>>> >>>>>>>>>> that case,
>>>>> >>>>>>>>>> the resolution rules are the
following:
>>>>> >>>>>>>>>>   - If the updates have a
different timestamp, keep the one
>>>>> >>>>>>>>>> with
>>>>> >>>>>>>>>> the higher timestamp. That is,
the more recent of two updates
>>>>> >>>>>>>>>> win.
>>>>> >>>>>>>>>>   - It the timestamps are
the same, then it compares the values
>>>>> >>>>>>>>>> (byte comparison) and keep the
highest value. This is just to
>>>>> >>>>>>>>>> break ties in
>>>>> >>>>>>>>>> a consistent manner.
>>>>> >>>>>>>>>> So if you do two truly concurrent
updates (that is from two
>>>>> >>>>>>>>>> place
>>>>> >>>>>>>>>> at the same instant), then you'll
end with one of the update.
>>>>> >>>>>>>>>> This is the
>>>>> >>>>>>>>>> column level.
>>>>> >>>>>>>>>> However, if that simple conflict
detection/resolution mechanism
>>>>> >>>>>>>>>> is
>>>>> >>>>>>>>>> not good enough for some of
your use case and you need to keep
>>>>> >>>>>>>>>> two
>>>>> >>>>>>>>>> concurrent updates, it is easy
enough. Just make sure that the
>>>>> >>>>>>>>>> update don't
>>>>> >>>>>>>>>> end up in the same column. This
is easily achieved by appending
>>>>> >>>>>>>>>> some unique
>>>>> >>>>>>>>>> identifier to the column name
for instance. And when reading,
>>>>> >>>>>>>>>> do a slice and
>>>>> >>>>>>>>>> reconcile whatever you get back
with whatever logic make sense.
>>>>> >>>>>>>>>> If you do
>>>>> >>>>>>>>>> that, congrats, you've roughly
emulated what vector clocks
>>>>> >>>>>>>>>> would do. Btw, no
>>>>> >>>>>>>>>> locking or anything needed.
>>>>> >>>>>>>>>> In my experience, for most things
the timestamp resolution is
>>>>> >>>>>>>>>> enough. If the same user update
twice it's profile picture on
>>>>> >>>>>>>>>> you web site
>>>>> >>>>>>>>>> at the same microsecond, it's
usually fine to end up with one
>>>>> >>>>>>>>>> of the two
>>>>> >>>>>>>>>> pictures. In the rare case where
you need something more
>>>>> >>>>>>>>>> specific, using the
>>>>> >>>>>>>>>> cassandra data model usually
solves the problem easily. The
>>>>> >>>>>>>>>> reason for not
>>>>> >>>>>>>>>> having vector clocks in Cassandra
is that so far, we haven't
>>>>> >>>>>>>>>> really found
>>>>> >>>>>>>>>> much example where it is no
the case.
>>>>> >>>>>>>>>>
>>>>> >>>>>>>>>> --
>>>>> >>>>>>>>>> Sylvain
>>>>> >>>>>>>>>
>>>>> >>>>>>>>
>>>>> >>>>>>>
>>>>> >>>>>>
>>>>> >>>>>
>>>>> >>>>
>>>>> >>>
>>>>> >>
>>>>> >
>>>>> >
>>>>
>>>>
>>>
>>
>>
>> Just to make a note the "EVENTUAL" in eventual consistency could be a
>> time that is less then 1ms.
>>
>> I have a program that demonstrates that "eventual" means if i write
>> data at the weakest level, and read it back from a random another node
>> as soon as possible. 99% I see the update. I can share the code if you
>> would like.
>>
>> Remember http://en.wikipedia.org/wiki/Spacetime
>> ...but there is no reference frame in which the two events can occur
>> at the same time...
>>
>> As to MongoDB references ....Yes! most of the noSQL work differently.
>> They each approach CAP
>> http://www.julianbrowne.com/article/viewer/brewers-cap-theorem in a
>> different way.
>>
>> Cassandra does not lock (it is no secret). But remember, you can not
>> have it all pick 2/3 from CAP.
>>
>

http://voltdb.com/blog/clarifications-cap-theorem-and-data-related-errors
I was reading that and many of the points were well taken...up until...

Next generation DBMS technologies, such as VoltDB, have been shown to
run around 50X the speed of conventional SQL engines.  Thus, if you
need 200 nodes to support a specific SQL application, then VoltDB can
probably do the same application on 4 nodes.  The probability of a
failure on 200 nodes is wildly different than the probability of
failure on four nodes.

Come on? 200 nodes down to 4? I just can not take it seriously any more.

Mime
View raw message