How would your solution deal with complete network partitions? A node being
'down' does not actually mean it is dead, just that it is unreachable from
whatever is making the decision to mark it 'down'.
Following from Ryan's example, consider nodes A, B, and C but within a
fully partitioned network: all of the nodes are up but each thinks all the
others are down. Your ALL_AVAILABLE consistency level would boil down to
consistency level ONE for clients connecting to any of the nodes. If I
connect to A, it thinks it is the last one standing and translates
'ALL_AVALIABLE' into 'ONE'. Based on your logic, two clients connecting to
two different nodes could each modify a value then read it, thinking that
its 100% consistent yet it is actually *completely* inconsistent with the
value on other node(s).
I suggest you review the principles of the infamous CAP theorem. The
consistency levels as the stand now, allow for an explicit trade off between
'available and partition tolerant' (ONE read/write) OR 'consistent and
available' (QUORUM read/write). Your solution achieves only availability and
can guarantee neither consistency nor partition tolerance.
On Thu, Jun 16, 2011 at 7:50 PM, Ryan King <ryan@twitter.com> wrote:
> On Thu, Jun 16, 2011 at 2:12 PM, AJ <aj@dude.podzone.net> wrote:
> > On 6/16/2011 2:37 PM, Ryan King wrote:
> >>
> >> On Thu, Jun 16, 2011 at 1:05 PM, AJ<aj@dude.podzone.net> wrote:
> >
> >> <snip>
> >>>>
> >>>> The Cassandra consistency model is pretty elegant and this type of
> >>>> approach breaks that elegance in many ways. It would also only really
> be
> >>>> useful when the value has a high probability of being updated between
> a
> >>>> node
> >>>> going down and the value being read.
> >>>
> >>> I'm not sure what you mean. A node can be down for days during which
> >>> time
> >>> the value can be updated. The intention is to use the nodes available
> >>> even
> >>> if they fall below the RF. If there is only 1 node available for
> >>> accepting
> >>> a replica, that should be enough given the conditions I stated and
> >>> updated
> >>> below.
> >>
> >> If this is your constraint, then you should just use CL.ONE.
> >>
> > My constraint is a CL = "All Available". So, CL.ONE will not work.
>
> That's a solution, not a requirement. What's your requirement?
>
> >>>>
> >>>> Perhaps the simpler approach which is fairly trivial and does not
> >>>> require
> >>>> any Cassandra change is to simply downgrade your read from ALL to
> QUORUM
> >>>> when you get an unavailable exception for this particular read.
> >>>
> >>> It's not so trivial, esp since you would have to build that into your
> >>> client
> >>> at many levels. I think it would be more appropriate (if this idea
> >>> survives) to put it into Cass.
> >>>>
> >>>> I think the general answerer for 'maximum consistency' is QUORUM
> >>>> reads/writes. Based on the fact you are using CL=ALL for reads I
> assume
> >>>> you
> >>>> are using CL=ONE for writes: this itself strikes me as a bad idea if
> you
> >>>> require 'maximum consistency for one critical operation'.
> >>>>
> >>> Very true. Specifying quorum for BOTH reads/writes provides the 100%
> >>> consistency because of the overlapping of the availability numbers.
> But,
> >>> only if the # of available nodes is not< RF.
> >>
> >> No, it will work as long as the available nodes is>= RF/2 + 1
> >
> > Yes, that's what I meant. Sorry for any confusion. Restated: But, only
> if
> > the # of available nodes is not < RF/2 + 1.
> >>>
> >>> Upon further reflection, this idea can be used for any consistency
> level.
> >>> The general thrust of my argument is: If a particular value can be
> >>> overwritten by one process regardless of it's prior value, then that
> >>> implies
> >>> that the value in the down node is no longer uptodate and can be
> >>> disregarded. Just work with the nodes that are available.
> >>>
> >>> Actually, now that I think about it...
> >>>
> >>> ALL_AVAIL guarantees 100% consistency iff the latest timestamp of the
> >>> value
> >>>>
> >>>> latest unavailability time of all unavailable replica nodes for that
> >>>
> >>> value's row key. Unavailable is defined as a node's Cass process that
> is
> >>> not reachable from ANY node in the cluster in the same data center. If
> >>> the
> >>> node in question is available to at least one node, then the read
> should
> >>> fail as there is a possibility that the value could have been updated
> >>> some
> >>> other way.
> >>
> >> Node A can't reliably and consistently know whether node B and node C
> >> can communicate.
> >
> > Well, theoretically, of course; that's the nature of distributed systems.
> > But, Cass does indeed make that determination when it counts the number
> > available replica nodes before it decides if enough replica nodes are
> > available. But, this is obvious to you I'm sure so maybe I don't
> understand
> > your statement.
>
> Consider this scenario: given nodes, A, B and C and A thinks C is down
> but B thinks C is up. What do you do? Remember, A doesn't know that B
> thinks C is up, it only knows its own state.
>
> ryan
>
