incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From A J <s5a...@gmail.com>
Subject Re: New Chain for : Does Cassandra use vector clocks
Date Fri, 25 Feb 2011 15:08:05 GMT
He has a product to sell, so you can expect some advertising. But in
general, Stonebraker's articles are very deep (another one that
challenges general conceptions is
http://voltdb.com/voltdb-webinar-sql-urban-myths ) . He is the creator
of Postgres and considered a guru in databases by many.
And actually if you cannot let go of ACID and not satisfied with
traditional DBMS solutions, voltdb is worth considering. It ofcourse
solves a different problem(oltp) than what Cassandra does.


On Thu, Feb 24, 2011 at 5:20 PM, Edward Capriolo <edlinuxguru@gmail.com> wrote:
> On Thu, Feb 24, 2011 at 3:56 PM, A J <s5alye@gmail.com> wrote:
>> While we are at it, there's more to consider than just CAP in distributed :)
>> http://voltdb.com/blog/clarifications-cap-theorem-and-data-related-errors
>>
>> On Thu, Feb 24, 2011 at 3:31 PM, Edward Capriolo <edlinuxguru@gmail.com> wrote:
>>> On Thu, Feb 24, 2011 at 3:03 PM, A J <s5alye@gmail.com> wrote:
>>>> yes, that is difficult to digest and one has to be sure if the use
>>>> case can afford it.
>>>>
>>>> Some other NOSQL databases deals with it differently (though I don't
>>>> think any of them use atomic 2-phase commit). MongoDB for example will
>>>> ask you to read from the node you wrote first (primary node) unless
>>>> you are ok with eventual consistency. If the write did not make to
>>>> majority of other nodes, it will be rolled-back from the original
>>>> primary when it comes up again as a secondary.
>>>> In some cases, you still could server either new value (that was
>>>> returned as failed) or the old one. But it is different from Cassandra
>>>> in the sense that Cassandra will never rollback.
>>>>
>>>>
>>>>
>>>> On Thu, Feb 24, 2011 at 2:47 PM, Anthony John <chirayithaj@gmail.com>
wrote:
>>>>> The leap of faith here is that an error does not mean a clean backing
out to
>>>>> prior state - as we are used to with databases. It means that the operation
>>>>> in error could have gone through partially
>>>>>
>>>>> Again, this is not an absolutely unfamiliar territory and can be dealt
with.
>>>>> -JA
>>>>> On Thu, Feb 24, 2011 at 1:16 PM, A J <s5alye@gmail.com> wrote:
>>>>>>
>>>>>> >>but could be broken in case of a failed write<<
>>>>>> You can think of a scenario where R + W >N still leads to
>>>>>> inconsistency even for successful writes. Say you keep W=1 and R=N
.
>>>>>> Lets say the one node where a write happened with success goes down
>>>>>> before it made to the other N-1 nodes. Lets say it goes down for
good
>>>>>> and is unrecoverable. The only option is to build a new node from
>>>>>> scratch from other active nodes. This will lead to a write that was
>>>>>> lost and you will end up serving stale copy of it.
>>>>>>
>>>>>> It is better to talk in terms of use cases and if cassandra will
be a
>>>>>> fit for it. Otherwise unless you have W=R=N and fsync before each
>>>>>> write commit, there will be scope for inconsistency.
>>>>>>
>>>>>>
>>>>>> On Thu, Feb 24, 2011 at 1:25 PM, Anthony John <chirayithaj@gmail.com>
>>>>>> wrote:
>>>>>> > I see the point - apologies for putting everyone through this!
>>>>>> > It was just militating against my mental model.
>>>>>> > In summary, here is my take away - simple stuff but - IMO -
important to
>>>>>> > conclude this thread (I hope):-
>>>>>> > 1. I was splitting hair over a failed ( partial ) Q Write. Such
an event
>>>>>> > should be immediately followed by the same write going to a
connection
>>>>>> > on to
>>>>>> > another node ( potentially using connection caches of client
>>>>>> > implementations
>>>>>> > ) or a Read at CL of All. Because a write could have partially
gone
>>>>>> > through.
>>>>>> > 2. Timestamps are used in determining the latest version ( correcting
>>>>>> > the
>>>>>> > false impression I was propagating)
>>>>>> > Finally, wrt "W + R > N for Q CL statement" holds, but could
be broken
>>>>>> > in
>>>>>> > case of a failed write as it is unsure whether the new value
got written
>>>>>> > on
>>>>>> >  any server or not. Is that a fair characterization ?
>>>>>> > Bottom line - unlike traditional DBMS, errors do not ensure
automatic
>>>>>> > cleanup and revert back, app code has to follow up if  immediate
- and
>>>>>> > not
>>>>>> > eventual -  consistency is desired. I made that leap in almost
all cases
>>>>>> > - I
>>>>>> > think - but the case of a failed write.
>>>>>> > My bad and I can live with this!
>>>>>> > Regards,
>>>>>> > -JA
>>>>>> >
>>>>>> > On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne
>>>>>> > <sylvain@datastax.com>
>>>>>> > wrote:
>>>>>> >>
>>>>>> >> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John <chirayithaj@gmail.com>
>>>>>> >> wrote:
>>>>>> >>>
>>>>>> >>> Completely understand!
>>>>>> >>> All that I am quibbling over is whether a CL of quorum
guarantees
>>>>>> >>> consistency or not. That is what the documentation says
- right. IF
>>>>>> >>> for a CL
>>>>>> >>> of Q read - it depends on which node returns read first
to determine
>>>>>> >>> the
>>>>>> >>> actual returned result or other more convoluted conditions
, then a
>>>>>> >>> Quorum
>>>>>> >>> read/write is not consistent, by any definition.
>>>>>> >>
>>>>>> >> But that's the point. The definition of consistency we are
talking
>>>>>> >> about
>>>>>> >> has no meaning if you consider only a quorum read. The definition
>>>>>> >> (which is
>>>>>> >> the de facto definition of consistency in 'eventually consistent')
make
>>>>>> >> sense if we talk about a write followed by a read. And it
is
>>>>>> >> considering succeeding write followed by succeeding read.
>>>>>> >> And that is the statement the wiki is making.
>>>>>> >> Honestly, we could debate forever on the definition of consistency
and
>>>>>> >> whatnot. Cassandra guaranties that if you do a (succeeding)
write on W
>>>>>> >> replica and then a (succeeding) read on R replica and if
R+W>N, then it
>>>>>> >> is
>>>>>> >> guaranteed that the read will see the preceding write. And
this is what
>>>>>> >> is
>>>>>> >> called consistency in the context of eventual consistency
(which is not
>>>>>> >> the
>>>>>> >> context of ACID).
>>>>>> >> If this is not the definition of consistency you had in
mind then by
>>>>>> >> all
>>>>>> >> mean, Cassandra probably don't guarantee this definition.
But given
>>>>>> >> that the
>>>>>> >> paragraph preceding what you pasted state clearly we are
not talking
>>>>>> >> about
>>>>>> >> ACID consistency, but eventual consistency, I don't think
the wiki is
>>>>>> >> making
>>>>>> >> any unfair statement.
>>>>>> >> That being said, the wiki may not be always as clear as
it could. But
>>>>>> >> it's
>>>>>> >> an editable wiki :)
>>>>>> >> --
>>>>>> >> Sylvain
>>>>>> >>
>>>>>> >>>
>>>>>> >>> I can still use Cassandra, and will use it, luv it!!!
But let us not
>>>>>> >>> make
>>>>>> >>> this statement on the Wiki architecture section:-
>>>>>> >>> -------------------------------------------------------------
>>>>>> >>>
>>>>>> >>> More specifically: R=read replica count W=write replica
>>>>>> >>> count N=replication factor Q=QUORUM (Q = N / 2 +
1)
>>>>>> >>>
>>>>>> >>> If W + R > N, you will have consistency
>>>>>> >>>
>>>>>> >>> W=1, R=N
>>>>>> >>> W=N, R=1
>>>>>> >>> W=Q, R=Q where Q = N / 2 + 1
>>>>>> >>>
>>>>>> >>> Cassandra provides consistency when R + W > N (read
replica count
>>>>>> >>> + write
>>>>>> >>> replica count > replication factor).
>>>>>> >>>
>>>>>> >>> ----------------------------------------------------
>>>>>> >>>
>>>>>> >>> .
>>>>>> >>>
>>>>>> >>> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne
>>>>>> >>> <sylvain@datastax.com>
>>>>>> >>> wrote:
>>>>>> >>>>
>>>>>> >>>> On Thu, Feb 24, 2011 at 6:01 PM, Anthony John <chirayithaj@gmail.com>
>>>>>> >>>> wrote:
>>>>>> >>>>>
>>>>>> >>>>> If you are correct and you are probably closer
to the code - then CL
>>>>>> >>>>> of
>>>>>> >>>>> Quorum does not guarantee a consistency.
>>>>>> >>>>
>>>>>> >>>> If the operation succeed, it does (for some definition
of consistency
>>>>>> >>>> which is, following reads at Quorum will be guaranteed
to see the new
>>>>>> >>>> value
>>>>>> >>>> of a update at quorum). If it fails, then no, it
does not guarantee
>>>>>> >>>> consistency.
>>>>>> >>>> It is important to note that the word consistency
has multiple
>>>>>> >>>> meaning.
>>>>>> >>>> In particular, when we are talking of consistency
in Cassandra, we
>>>>>> >>>> are not
>>>>>> >>>> talking of the same definition as the C in ACID
>>>>>> >>>>
>>>>>> >>>> (see: http://www.allthingsdistributed.com/2007/12/eventually_consistent.html)
>>>>>> >>>>>
>>>>>> >>>>> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne
>>>>>> >>>>> <sylvain@datastax.com> wrote:
>>>>>> >>>>>>
>>>>>> >>>>>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony
John
>>>>>> >>>>>> <chirayithaj@gmail.com>
>>>>>> >>>>>> wrote:
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> >>Time stamps are not used
for conflict resolution - unless is is
>>>>>> >>>>>>>> >> part of the application
logic!!!
>>>>>> >>>>>>>
>>>>>> >>>>>>> >>What is you definition of conflict
resolution ? Because if you
>>>>>> >>>>>>> >> update twice the same column
(which
>>>>>> >>>>>>> >>I'll call a conflict), then
the timestamps are used to decide
>>>>>> >>>>>>> >> which
>>>>>> >>>>>>> >> update wins (which I'll call
a resolution).
>>>>>> >>>>>>> I understand what you are saying, and
yes semantics is very
>>>>>> >>>>>>> important
>>>>>> >>>>>>> here. And yes we are responding to the
immediate questions without
>>>>>> >>>>>>> covering
>>>>>> >>>>>>> all questions in the thread.
>>>>>> >>>>>>> The point being made here is that the
timestamp of the column is
>>>>>> >>>>>>> not
>>>>>> >>>>>>> used by Cassandra to figure out what
data to return.
>>>>>> >>>>>>
>>>>>> >>>>>> Not quite true.
>>>>>> >>>>>>>
>>>>>> >>>>>>> E.g. - Quorum is 2 nodes - and RF of
3 over N1/2/3
>>>>>> >>>>>>> A Quorum  Write comes and add/updates
the time stamp (TS2) of a
>>>>>> >>>>>>> particular data element. It succeeds
on N1 - fails on N2/3. So the
>>>>>> >>>>>>> write is
>>>>>> >>>>>>> returned as failed - right ?
>>>>>> >>>>>>> Now Quorum read comes in for exactly
the same piece of data that
>>>>>> >>>>>>> the
>>>>>> >>>>>>> write failed for.
>>>>>> >>>>>>> So N1 has TS2 but both N2/3 have the
old TS (say TS1)
>>>>>> >>>>>>> And the read succeeds - Will it return
TS1 or TS2.
>>>>>> >>>>>>> I submit it will return TS1 - the old
TS.
>>>>>> >>>>>>
>>>>>> >>>>>> It all depends on which (first 2) nodes
respond to the read (since
>>>>>> >>>>>> RF=3, that can any two of N1/N2/N3). If
N1 is part of the two that
>>>>>> >>>>>> makes the
>>>>>> >>>>>> quorum, then TS2 will be returned, because
cassandra will compare
>>>>>> >>>>>> the
>>>>>> >>>>>> timestamp and decide what to return based
on this. If N2/N3
>>>>>> >>>>>> responds
>>>>>> >>>>>> however, both timestamp will be TS1 and
so, after timestamp
>>>>>> >>>>>> resolution, it
>>>>>> >>>>>> will stil be TS1 that will be returned.
>>>>>> >>>>>> So yes timestamp is used for conflict resolution.
>>>>>> >>>>>> In your example, you could get TS1 back
because a failed write can
>>>>>> >>>>>> let
>>>>>> >>>>>> you cluster in an inconsistent state. You'd
have to retry the
>>>>>> >>>>>> quorum and
>>>>>> >>>>>> only when it succeeds can you be guaranteed
that quorum read will
>>>>>> >>>>>> always
>>>>>> >>>>>> return TS2.
>>>>>> >>>>>> This is because when a write fails, Cassandra
doesn't guarantee
>>>>>> >>>>>> that
>>>>>> >>>>>> the write did not made it in (there is no
revert).
>>>>>> >>>>>>
>>>>>> >>>>>>>
>>>>>> >>>>>>> Are we on the same page with this interpretation
?
>>>>>> >>>>>>> Regards,
>>>>>> >>>>>>> -JA
>>>>>> >>>>>>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain
Lebresne
>>>>>> >>>>>>> <sylvain@datastax.com> wrote:
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> On Thu, Feb 24, 2011 at 4:52 PM,
Anthony John
>>>>>> >>>>>>>> <chirayithaj@gmail.com> wrote:
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>>> Sylvan,
>>>>>> >>>>>>>>> Time stamps are not used for
conflict resolution - unless is is
>>>>>> >>>>>>>>> part of the application logic!!!
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> What is you definition of conflict
resolution ? Because if you
>>>>>> >>>>>>>> update twice the same column (which
>>>>>> >>>>>>>> I'll call a conflict), then the
timestamps are used to decide
>>>>>> >>>>>>>> which
>>>>>> >>>>>>>> update wins (which I'll call a resolution).
>>>>>> >>>>>>>>
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>>> You can have "lost updates"
w/Cassandra. You need to to use 3rd
>>>>>> >>>>>>>>> products - cages for e.g. -
to get ACID type consistency.
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> Then again, you'll have to define
what you are calling "lost
>>>>>> >>>>>>>> updates". Provided you use a reasonable
consistency level,
>>>>>> >>>>>>>> Cassandra
>>>>>> >>>>>>>> provides fairly strong durability
guarantee, so for some
>>>>>> >>>>>>>> definition you
>>>>>> >>>>>>>> don't "lose updates".
>>>>>> >>>>>>>> That being said, I never pretended
that Cassandra provided any
>>>>>> >>>>>>>> ACID
>>>>>> >>>>>>>> guarantee. ACID relates to transaction,
which Cassandra doesn't
>>>>>> >>>>>>>> support. If
>>>>>> >>>>>>>> we're talking about the guarantees
of transaction, then by all
>>>>>> >>>>>>>> means,
>>>>>> >>>>>>>> cassandra won't provide it. And
yes you can use cages or the like
>>>>>> >>>>>>>> to get
>>>>>> >>>>>>>> transaction. But that was not the
point of the thread, was it ?
>>>>>> >>>>>>>> The thread
>>>>>> >>>>>>>> is about vector clocks, and that
has nothing to do with
>>>>>> >>>>>>>> transaction (vector
>>>>>> >>>>>>>> clocks certainly don't give you
transactions).
>>>>>> >>>>>>>> Sorry if I wasn't clear in my mail,
but I was only responding to
>>>>>> >>>>>>>> why
>>>>>> >>>>>>>> so far I don't think vector clocks
would really provide much for
>>>>>> >>>>>>>> Cassandra.
>>>>>> >>>>>>>> --
>>>>>> >>>>>>>> Sylvain
>>>>>> >>>>>>>>
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>>> -JA
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>>> On Thu, Feb 24, 2011 at 7:41
AM, Sylvain Lebresne
>>>>>> >>>>>>>>> <sylvain@datastax.com>
wrote:
>>>>>> >>>>>>>>>>
>>>>>> >>>>>>>>>> On Thu, Feb 24, 2011 at
3:22 AM, Anthony John
>>>>>> >>>>>>>>>> <chirayithaj@gmail.com>
wrote:
>>>>>> >>>>>>>>>>>
>>>>>> >>>>>>>>>>> Apologies : For some
reason my response on the original mail
>>>>>> >>>>>>>>>>> keeps bouncing back,
thus this new one!
>>>>>> >>>>>>>>>>>
>>>>>> >>>>>>>>>>> > From the other
hand, the same article says:
>>>>>> >>>>>>>>>>> > "For conditional
writes to work, the condition must be
>>>>>> >>>>>>>>>>> > evaluated at all
update
>>>>>> >>>>>>>>>>> > sites before the
write can be allowed to succeed."
>>>>>> >>>>>>>>>>> >
>>>>>> >>>>>>>>>>> > This means, that
when doing such an update CL=ALL must be
>>>>>> >>>>>>>>>>> > used
>>>>>> >>>>>>>>>>>
>>>>>> >>>>>>>>>>> Sorry, but I am confused
by that entire thread!
>>>>>> >>>>>>>>>>> Questions:-
>>>>>> >>>>>>>>>>> 1. Does Cassandra implement
any kind of data locking - at any
>>>>>> >>>>>>>>>>> granularity whether
it be row/colF/Col ?
>>>>>> >>>>>>>>>>
>>>>>> >>>>>>>>>> No locking, no.
>>>>>> >>>>>>>>>>
>>>>>> >>>>>>>>>>>
>>>>>> >>>>>>>>>>> 2. If the answer to
1 above is NO! - how does CL ALL prevent
>>>>>> >>>>>>>>>>> conflicts. Concurrent
updates on exactly the same piece of
>>>>>> >>>>>>>>>>> data on different
>>>>>> >>>>>>>>>>> nodes can still mess
each other up, right ?
>>>>>> >>>>>>>>>>
>>>>>> >>>>>>>>>> Not sure why you are taking
CL.ALL specifically. But in any CL,
>>>>>> >>>>>>>>>> updating the same piece
of data means the same column value. In
>>>>>> >>>>>>>>>> that case,
>>>>>> >>>>>>>>>> the resolution rules are
the following:
>>>>>> >>>>>>>>>>   - If the updates have
a different timestamp, keep the one
>>>>>> >>>>>>>>>> with
>>>>>> >>>>>>>>>> the higher timestamp. That
is, the more recent of two updates
>>>>>> >>>>>>>>>> win.
>>>>>> >>>>>>>>>>   - It the timestamps
are the same, then it compares the values
>>>>>> >>>>>>>>>> (byte comparison) and keep
the highest value. This is just to
>>>>>> >>>>>>>>>> break ties in
>>>>>> >>>>>>>>>> a consistent manner.
>>>>>> >>>>>>>>>> So if you do two truly concurrent
updates (that is from two
>>>>>> >>>>>>>>>> place
>>>>>> >>>>>>>>>> at the same instant), then
you'll end with one of the update.
>>>>>> >>>>>>>>>> This is the
>>>>>> >>>>>>>>>> column level.
>>>>>> >>>>>>>>>> However, if that simple
conflict detection/resolution mechanism
>>>>>> >>>>>>>>>> is
>>>>>> >>>>>>>>>> not good enough for some
of your use case and you need to keep
>>>>>> >>>>>>>>>> two
>>>>>> >>>>>>>>>> concurrent updates, it is
easy enough. Just make sure that the
>>>>>> >>>>>>>>>> update don't
>>>>>> >>>>>>>>>> end up in the same column.
This is easily achieved by appending
>>>>>> >>>>>>>>>> some unique
>>>>>> >>>>>>>>>> identifier to the column
name for instance. And when reading,
>>>>>> >>>>>>>>>> do a slice and
>>>>>> >>>>>>>>>> reconcile whatever you get
back with whatever logic make sense.
>>>>>> >>>>>>>>>> If you do
>>>>>> >>>>>>>>>> that, congrats, you've roughly
emulated what vector clocks
>>>>>> >>>>>>>>>> would do. Btw, no
>>>>>> >>>>>>>>>> locking or anything needed.
>>>>>> >>>>>>>>>> In my experience, for most
things the timestamp resolution is
>>>>>> >>>>>>>>>> enough. If the same user
update twice it's profile picture on
>>>>>> >>>>>>>>>> you web site
>>>>>> >>>>>>>>>> at the same microsecond,
it's usually fine to end up with one
>>>>>> >>>>>>>>>> of the two
>>>>>> >>>>>>>>>> pictures. In the rare case
where you need something more
>>>>>> >>>>>>>>>> specific, using the
>>>>>> >>>>>>>>>> cassandra data model usually
solves the problem easily. The
>>>>>> >>>>>>>>>> reason for not
>>>>>> >>>>>>>>>> having vector clocks in
Cassandra is that so far, we haven't
>>>>>> >>>>>>>>>> really found
>>>>>> >>>>>>>>>> much example where it is
no the case.
>>>>>> >>>>>>>>>>
>>>>>> >>>>>>>>>> --
>>>>>> >>>>>>>>>> Sylvain
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>>
>>>>>> >>>>>>>
>>>>>> >>>>>>
>>>>>> >>>>>
>>>>>> >>>>
>>>>>> >>>
>>>>>> >>
>>>>>> >
>>>>>> >
>>>>>
>>>>>
>>>>
>>>
>>>
>>> Just to make a note the "EVENTUAL" in eventual consistency could be a
>>> time that is less then 1ms.
>>>
>>> I have a program that demonstrates that "eventual" means if i write
>>> data at the weakest level, and read it back from a random another node
>>> as soon as possible. 99% I see the update. I can share the code if you
>>> would like.
>>>
>>> Remember http://en.wikipedia.org/wiki/Spacetime
>>> ...but there is no reference frame in which the two events can occur
>>> at the same time...
>>>
>>> As to MongoDB references ....Yes! most of the noSQL work differently.
>>> They each approach CAP
>>> http://www.julianbrowne.com/article/viewer/brewers-cap-theorem in a
>>> different way.
>>>
>>> Cassandra does not lock (it is no secret). But remember, you can not
>>> have it all pick 2/3 from CAP.
>>>
>>
>
> http://voltdb.com/blog/clarifications-cap-theorem-and-data-related-errors
> I was reading that and many of the points were well taken...up until...
>
> Next generation DBMS technologies, such as VoltDB, have been shown to
> run around 50X the speed of conventional SQL engines.  Thus, if you
> need 200 nodes to support a specific SQL application, then VoltDB can
> probably do the same application on 4 nodes.  The probability of a
> failure on 200 nodes is wildly different than the probability of
> failure on four nodes.
>
> Come on? 200 nodes down to 4? I just can not take it seriously any more.
>

Mime
View raw message