Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0DE75615E for ; Fri, 17 Jun 2011 01:56:35 +0000 (UTC) Received: (qmail 15507 invoked by uid 500); 17 Jun 2011 01:56:32 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 15480 invoked by uid 500); 17 Jun 2011 01:56:32 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 15472 invoked by uid 99); 17 Jun 2011 01:56:32 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Jun 2011 01:56:32 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dan.hendry.junk@gmail.com designates 209.85.214.172 as permitted sender) Received: from [209.85.214.172] (HELO mail-iw0-f172.google.com) (209.85.214.172) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Jun 2011 01:56:28 +0000 Received: by iwn39 with SMTP id 39so2139322iwn.31 for ; Thu, 16 Jun 2011 18:56:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=QDtFSHIQL97JGnM8kqfBWWRJbBQGSTuV0BX1Z+ATxcg=; b=t+04nKOeiYSXBTIBQqdXVBSvDiTCyhUaiByTNSGkQW/OiA+9NpWsYx3VyR+bBaLlXe Gyk4qaK9e8sSZz64vISMk7IJC3Ganb1+IDzi2pDWNLYgO+sMjHgii+NLIudiRyCkti5s X0BFphCtan/MWwB00oI9zNGtPEDSn7R9KSyPI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=b8qC6mjMV6GJqL8mOpFBanXvWE/AfA/zpQCJI0FXvUW5YTo+A06YgvRVn+diVAYmk6 WqMyiQnN5Ugnetc3jhYGa6g/hEwTBPcpV4U+npbHx2G7mWKlpzFhrR0VTfubryL6cHhG wHANKg/hFwfVa994MXqvJ82zX3BgRk3KwwbH4= MIME-Version: 1.0 Received: by 10.42.161.201 with SMTP id u9mr1267183icx.192.1308275767019; Thu, 16 Jun 2011 18:56:07 -0700 (PDT) Received: by 10.231.32.203 with HTTP; Thu, 16 Jun 2011 18:56:06 -0700 (PDT) In-Reply-To: References: <4DFA1EBE.3030202@dude.podzone.net> <4dfa3659.09a32a0a.3123.658a@mx.google.com> <4DFA6215.5080103@dude.podzone.net> <4DFA71C6.9050504@dude.podzone.net> Date: Thu, 16 Jun 2011 21:56:06 -0400 Message-ID: Subject: Re: Propose new ConsistencyLevel.ALL_AVAIL for reads From: Dan Hendry To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=90e6ba6e8ac85c889604a5deb09d --90e6ba6e8ac85c889604a5deb09d Content-Type: text/plain; charset=ISO-8859-1 How would your solution deal with complete network partitions? A node being 'down' does not actually mean it is dead, just that it is unreachable from whatever is making the decision to mark it 'down'. Following from Ryan's example, consider nodes A, B, and C but within a fully partitioned network: all of the nodes are up but each thinks all the others are down. Your ALL_AVAILABLE consistency level would boil down to consistency level ONE for clients connecting to any of the nodes. If I connect to A, it thinks it is the last one standing and translates 'ALL_AVALIABLE' into 'ONE'. Based on your logic, two clients connecting to two different nodes could each modify a value then read it, thinking that its 100% consistent yet it is actually *completely* inconsistent with the value on other node(s). I suggest you review the principles of the infamous CAP theorem. The consistency levels as the stand now, allow for an explicit trade off between 'available and partition tolerant' (ONE read/write) OR 'consistent and available' (QUORUM read/write). Your solution achieves only availability and can guarantee neither consistency nor partition tolerance. On Thu, Jun 16, 2011 at 7:50 PM, Ryan King wrote: > On Thu, Jun 16, 2011 at 2:12 PM, AJ wrote: > > On 6/16/2011 2:37 PM, Ryan King wrote: > >> > >> On Thu, Jun 16, 2011 at 1:05 PM, AJ wrote: > > > >> > >>>> > >>>> The Cassandra consistency model is pretty elegant and this type of > >>>> approach breaks that elegance in many ways. It would also only really > be > >>>> useful when the value has a high probability of being updated between > a > >>>> node > >>>> going down and the value being read. > >>> > >>> I'm not sure what you mean. A node can be down for days during which > >>> time > >>> the value can be updated. The intention is to use the nodes available > >>> even > >>> if they fall below the RF. If there is only 1 node available for > >>> accepting > >>> a replica, that should be enough given the conditions I stated and > >>> updated > >>> below. > >> > >> If this is your constraint, then you should just use CL.ONE. > >> > > My constraint is a CL = "All Available". So, CL.ONE will not work. > > That's a solution, not a requirement. What's your requirement? > > >>>> > >>>> Perhaps the simpler approach which is fairly trivial and does not > >>>> require > >>>> any Cassandra change is to simply downgrade your read from ALL to > QUORUM > >>>> when you get an unavailable exception for this particular read. > >>> > >>> It's not so trivial, esp since you would have to build that into your > >>> client > >>> at many levels. I think it would be more appropriate (if this idea > >>> survives) to put it into Cass. > >>>> > >>>> I think the general answerer for 'maximum consistency' is QUORUM > >>>> reads/writes. Based on the fact you are using CL=ALL for reads I > assume > >>>> you > >>>> are using CL=ONE for writes: this itself strikes me as a bad idea if > you > >>>> require 'maximum consistency for one critical operation'. > >>>> > >>> Very true. Specifying quorum for BOTH reads/writes provides the 100% > >>> consistency because of the overlapping of the availability numbers. > But, > >>> only if the # of available nodes is not< RF. > >> > >> No, it will work as long as the available nodes is>= RF/2 + 1 > > > > Yes, that's what I meant. Sorry for any confusion. Restated: But, only > if > > the # of available nodes is not < RF/2 + 1. > >>> > >>> Upon further reflection, this idea can be used for any consistency > level. > >>> The general thrust of my argument is: If a particular value can be > >>> overwritten by one process regardless of it's prior value, then that > >>> implies > >>> that the value in the down node is no longer up-to-date and can be > >>> disregarded. Just work with the nodes that are available. > >>> > >>> Actually, now that I think about it... > >>> > >>> ALL_AVAIL guarantees 100% consistency iff the latest timestamp of the > >>> value > >>>> > >>>> latest unavailability time of all unavailable replica nodes for that > >>> > >>> value's row key. Unavailable is defined as a node's Cass process that > is > >>> not reachable from ANY node in the cluster in the same data center. If > >>> the > >>> node in question is available to at least one node, then the read > should > >>> fail as there is a possibility that the value could have been updated > >>> some > >>> other way. > >> > >> Node A can't reliably and consistently know whether node B and node C > >> can communicate. > > > > Well, theoretically, of course; that's the nature of distributed systems. > > But, Cass does indeed make that determination when it counts the number > > available replica nodes before it decides if enough replica nodes are > > available. But, this is obvious to you I'm sure so maybe I don't > understand > > your statement. > > Consider this scenario: given nodes, A, B and C and A thinks C is down > but B thinks C is up. What do you do? Remember, A doesn't know that B > thinks C is up, it only knows its own state. > > -ryan > --90e6ba6e8ac85c889604a5deb09d Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable How would your solution deal with complete network partitions? A node being= 'down' does not actually mean it is dead, just that it is unreacha= ble from whatever is making the decision to mark it 'down'.=A0

Following from Ryan's example, consider nodes A, B, and = C but within a fully=A0partitioned=A0network: all of the nodes are up but e= ach thinks all the others are down. Your ALL_AVAILABLE consistency level wo= uld boil down to consistency level ONE for clients connecting to any of the= nodes. If I connect to A, it thinks it is the last one standing and transl= ates 'ALL_AVALIABLE' into 'ONE'. Based on your logic, two c= lients connecting to two different nodes could each modify a value then rea= d it, thinking that its 100% consistent yet it is actually=A0*completely* i= nconsistent with the value on other node(s).=A0

I suggest you review the=A0principles of the=A0infamous= CAP theorem. The consistency levels as the stand now, allow for an explici= t=A0trade off between 'available=A0and partition=A0tolerant' (ONE r= ead/write) OR 'consistent and available' (QUORUM read/write). Your = solution achieves only availability and can guarantee neither consistency n= or partition tolerance.

On Thu, Jun 16, 2011 at 7:50 PM, Ryan K= ing <ryan@twitter.= com> wrote:
On Thu, Jun 16, 2011 at 2:12 PM, AJ <aj@dude.podzone.net> wrote:
> On 6/16/2011 2:37 PM, Ryan King wrote:
>>
>> On Thu, Jun 16, 2011 at 1:05 PM, AJ<aj@dude.podzone.net> =A0wrote:
>
>> <snip>
>>>>
>>>> The Cassandra consistency model is pretty elegant and this= type of
>>>> approach breaks that elegance in many ways. It would also = only really be
>>>> useful when the value has a high probability of being upda= ted between a
>>>> node
>>>> going down and the value being read.
>>>
>>> I'm not sure what you mean. =A0A node can be down for days= during which
>>> time
>>> the value can be updated. =A0The intention is to use the nodes= available
>>> even
>>> if they fall below the RF. =A0If there is only 1 node availabl= e for
>>> accepting
>>> a replica, that should be enough given the conditions I stated= and
>>> updated
>>> below.
>>
>> If this is your constraint, then you should just use CL.ONE.
>>
> My constraint is a CL =3D "All Available". =A0So, CL.ONE wil= l not work.

That's a solution, not a requirement. What's your requirement= ?

>>>>
>>>> Perhaps the simpler approach which is fairly trivial and d= oes not
>>>> require
>>>> any Cassandra change is to simply downgrade your read from= ALL to QUORUM
>>>> when you get an unavailable exception for this particular = read.
>>>
>>> It's not so trivial, esp since you would have to build tha= t into your
>>> client
>>> at many levels. =A0I think it would be more appropriate (if th= is idea
>>> survives) to put it into Cass.
>>>>
>>>> I think the general answerer for 'maximum consistency&= #39; is QUORUM
>>>> reads/writes. Based on the fact you are using CL=3DALL for= reads I assume
>>>> you
>>>> are using CL=3DONE for writes: this itself strikes me as a= bad idea if you
>>>> require 'maximum consistency for one critical operatio= n'.
>>>>
>>> Very true. =A0Specifying quorum for BOTH reads/writes provides= the 100%
>>> consistency because of the overlapping of the availability num= bers. =A0But,
>>> only if the # of available nodes is not< =A0RF.
>>
>> No, it will work as long as the available nodes is>=3D RF/2 + 1=
>
> Yes, that's what I meant. =A0Sorry for any confusion. =A0Restated:= But, only if
> the # of available nodes is not < RF/2 + 1.
>>>
>>> Upon further reflection, this idea can be used for any consist= ency level.
>>> =A0The general thrust of my argument is: =A0If a particular va= lue can be
>>> overwritten by one process regardless of it's prior value,= then that
>>> implies
>>> that the value in the down node is no longer up-to-date and ca= n be
>>> disregarded. =A0Just work with the nodes that are available. >>>
>>> Actually, now that I think about it...
>>>
>>> ALL_AVAIL guarantees 100% consistency iff the latest timestamp= of the
>>> value
>>>>
>>>> latest unavailability time of all unavailable replica node= s for that
>>>
>>> value's row key. =A0Unavailable is defined as a node's= Cass process that is
>>> not reachable from ANY node in the cluster in the same data ce= nter. =A0If
>>> the
>>> node in question is available to at least one node, then the r= ead should
>>> fail as there is a possibility that the value could have been = updated
>>> some
>>> other way.
>>
>> Node A can't reliably and consistently know =A0whether node B = and node C
>> can communicate.
>
> Well, theoretically, of course; that's the nature of distributed s= ystems.
> =A0But, Cass does indeed make that determination when it counts the nu= mber
> available replica nodes before it decides if enough replica nodes are<= br> > available. =A0But, this is obvious to you I'm sure so maybe I don&= #39;t understand
> your statement.

Consider this scenario: given nodes, A, B and C and A thinks C = is down
but B thinks C is up. What do you do? Remember, A doesn't know that B thinks C is up, it only knows its own state.

-ryan

--90e6ba6e8ac85c889604a5deb09d--