cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anthony John <chirayit...@gmail.com>
Subject Re: How does Cassandra handle failure during synchronous writes
Date Thu, 24 Feb 2011 02:05:56 GMT
>Remember the simple rule. Column with highest timestamp is the one that
will be considered correct EVENTUALLY. So consider following case:

I am sorry, that will return inconsistent results even a Q. Time stamp have
nothing to do with this. It is just an application provided artifact and
could be anything.

>c. Read with CL = QUORUM. If read hits node1 and node2/node3, new data that
was written to node1 will be returned.

In this case - N1 will be identified as a discrepancy and the change will be
discarded via read repair

On Wed, Feb 23, 2011 at 6:47 PM, Narendra Sharma
<narendra.sharma@gmail.com>wrote:

> Remember the simple rule. Column with highest timestamp is the one that
> will be considered correct EVENTUALLY. So consider following case:
>
> Cluster size = 3 (say node1, node2 and node3), RF = 3, Read/Write CL =
> QUORUM
> a. QUORUM in this case requires 2 nodes. Write failed with successful write
> to only 1 node say node1.
> b. Read with CL = QUORUM. If read hits node2 and node3, old data will be
> returned with read repair triggered in background. On next read you will get
> the data that was written to node1.
> c. Read with CL = QUORUM. If read hits node1 and node2/node3, new data that
> was written to node1 will be returned.
>
> HTH!
>
> Thanks,
> Naren
>
>
>
> On Wed, Feb 23, 2011 at 3:36 PM, Ritesh Tijoriwala <
> tijoriwala.ritesh@gmail.com> wrote:
>
>> Hi Anthony,
>> I am not talking about the case of CL ANY. I am talking about the case
>> where your consistency level is  R + W > N and you want to write to W nodes
>> but only succeed in writing to X ( where X < W) nodes and hence fail the
>> write to the client.
>>
>> thanks,
>> Ritesh
>>
>> On Wed, Feb 23, 2011 at 2:48 PM, Anthony John <chirayithaj@gmail.com>wrote:
>>
>>> Ritesh,
>>>
>>> At CL ANY - if all endpoints are down - a HH is written. And it is a
>>> successful write - not a failed write.
>>>
>>> Now that does not guarantee a READ of the value just written - but that
>>> is a risk that you take when you use the ANY CL!
>>>
>>> HTH,
>>>
>>> -JA
>>>
>>>
>>> On Wed, Feb 23, 2011 at 4:40 PM, Ritesh Tijoriwala <
>>> tijoriwala.ritesh@gmail.com> wrote:
>>>
>>>> hi Anthony,
>>>> While you stated the facts right, I don't see how it relates to the
>>>> question I ask. Can you elaborate specifically what happens in the case I
>>>> mentioned above to Dave?
>>>>
>>>> thanks,
>>>> Ritesh
>>>>
>>>>
>>>> On Wed, Feb 23, 2011 at 1:57 PM, Anthony John <chirayithaj@gmail.com>wrote:
>>>>
>>>>> Seems to me that the explanations are getting incredibly complicated
-
>>>>> while I submit the real issue is not!
>>>>>
>>>>> Salient points here:-
>>>>> 1. To be guaranteed data consistency - the writes and reads have to be
>>>>> at Quorum CL or more
>>>>> 2. Any W/R at lesser CL means that the application has to handle the
>>>>> inconsistency, or has to be tolerant of it
>>>>> 3. Writing at "ANY" CL - a special case - means that writes will always
>>>>> go through (as long as any node is up), even if the destination nodes
are
>>>>> not up. This is done via hinted handoff. But this can result in inconsistent
>>>>> reads, and yes that is a problem but refer to pt-2 above
>>>>> 4. At QUORUM CL R/W - after Quorum is met, hinted handoffs are used to
>>>>> handle that case where a particular node is down and the write needs
to be
>>>>> replicated to it. But this will not cause inconsistent R as the hinted
>>>>> handoff (in this case) only applies after Quorum is met - so a Quorum
R is
>>>>> not dependent on the down node being up, and having got the hint.
>>>>>
>>>>> Hope I state this appropriately!
>>>>>
>>>>> HTH,
>>>>>
>>>>> -JA
>>>>>
>>>>>
>>>>> On Wed, Feb 23, 2011 at 3:39 PM, Ritesh Tijoriwala <
>>>>> tijoriwala.ritesh@gmail.com> wrote:
>>>>>
>>>>>> > Read repair will probably occur at that point (depending on
your
>>>>>> config), which would cause the newest value to propagate to more
replicas.
>>>>>>
>>>>>> Is the newest value the "quorum" value which means it is the old
value
>>>>>> that will be written back to the nodes having "newer non-quorum"
value or
>>>>>> the newest value is the real new value? :) If later, than this seems
kind of
>>>>>> odd to me and how it will be useful to any application. A bug?
>>>>>>
>>>>>> Thanks,
>>>>>> Ritesh
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 23, 2011 at 12:43 PM, Dave Revell <dave@meebo-inc.com>wrote:
>>>>>>
>>>>>>> Ritesh,
>>>>>>>
>>>>>>> You have seen the problem. Clients may read the newly written
value
>>>>>>> even though the client performing the write saw it as a failure.
When the
>>>>>>> client reads, it will use the correct number of replicas for
the chosen CL,
>>>>>>> then return the newest value seen at any replica. This "newest
value" could
>>>>>>> be the result of a failed write.
>>>>>>>
>>>>>>> Read repair will probably occur at that point (depending on your
>>>>>>> config), which would cause the newest value to propagate to more
replicas.
>>>>>>>
>>>>>>> R+W>N guarantees serial order of operations: any read at CL=R
that
>>>>>>> occurs after a write at CL=W will observe the write. I don't
think this
>>>>>>> property is relevant to your current question, though.
>>>>>>>
>>>>>>> Cassandra has no mechanism to "roll back" the partial write,
other
>>>>>>> than to simply write again. This may also fail.
>>>>>>>
>>>>>>> Best,
>>>>>>> Dave
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Feb 23, 2011 at 10:12 AM, <tijoriwala.ritesh@gmail.com>wrote:
>>>>>>>
>>>>>>>> Hi Dave,
>>>>>>>> Thanks for your input. In the steps you mention, what happens
when
>>>>>>>> client tries to read the value at step 6? Is it possible
that the client may
>>>>>>>> see the new value? My understanding was if R + W > N,
then client will not
>>>>>>>> see the new value as Quorum nodes will not agree on the new
value. If that
>>>>>>>> is the case, then its alright to return failure to the client.
However, if
>>>>>>>> not, then it is difficult to program as after every failure,
you as an
>>>>>>>> client are not sure if failure is a pseudo failure with some
side effects or
>>>>>>>> real failure.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ritesh
>>>>>>>>
>>>>>>>> <quote author='Dave Revell'>
>>>>>>>>
>>>>>>>> Ritesh,
>>>>>>>>
>>>>>>>> There is no commit protocol. Writes may be persisted on some
>>>>>>>> replicas even
>>>>>>>> though the quorum fails. Here's a sequence of events that
shows the
>>>>>>>> "problem:"
>>>>>>>>
>>>>>>>> 1. Some replica R fails, but recently, so its failure has
not yet
>>>>>>>> been
>>>>>>>> detected
>>>>>>>> 2. A client writes with consistency > 1
>>>>>>>> 3. The write goes to all replicas, all replicas except R
persist the
>>>>>>>> write
>>>>>>>> to disk
>>>>>>>> 4. Replica R never responds
>>>>>>>> 5. Failure is returned to the client, but the new value is
still in
>>>>>>>> the
>>>>>>>> cluster, on all replicas except R.
>>>>>>>>
>>>>>>>> Something very similar could happen for CL QUORUM.
>>>>>>>>
>>>>>>>> This is a conscious design decision because a commit protocol
would
>>>>>>>> constitute tight coupling between nodes, which goes against
the
>>>>>>>> Cassandra
>>>>>>>> philosophy. But unfortunately you do have to write your app
with
>>>>>>>> this case
>>>>>>>> in mind.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Dave
>>>>>>>>
>>>>>>>> On Tue, Feb 22, 2011 at 8:22 PM, tijoriwala.ritesh <
>>>>>>>> tijoriwala.ritesh@gmail.com> wrote:
>>>>>>>>
>>>>>>>> >
>>>>>>>> > Hi,
>>>>>>>> > I wanted to get details on how does cassandra do synchronous
>>>>>>>> writes to W
>>>>>>>> > replicas (out of N)? Does it do a 2PC? If not, how does
it deal
>>>>>>>> with
>>>>>>>> > failures of of nodes before it gets to write to W replicas?
If the
>>>>>>>> > orchestrating node cannot write to W nodes successfully,
I guess
>>>>>>>> it will
>>>>>>>> > fail the write operation but what happens to the completed
writes
>>>>>>>> on X (W
>>>>>>>> > >
>>>>>>>> > X) nodes?
>>>>>>>> >
>>>>>>>> > Thanks,
>>>>>>>> > Ritesh
>>>>>>>> > --
>>>>>>>> > View this message in context:
>>>>>>>> >
>>>>>>>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055152.html
>>>>>>>> > Sent from the cassandra-user@incubator.apache.org mailing
list
>>>>>>>> archive at
>>>>>>>> > Nabble.com.
>>>>>>>> >
>>>>>>>>
>>>>>>>> </quote>
>>>>>>>> Quoted from:
>>>>>>>>
>>>>>>>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-tp6055152p6055408.html
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message