Mailing-List: contact cassandra-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: cassandra-dev@incubator.apache.org
Received-SPF: pass (nike.apache.org: domain of avinash.lakshman@gmail.com
 designates 209.85.198.250 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=BaxHwFpiv0PiUEp5H8ueCJZf/2T6U3XiX+0f5rsrufaJulR5BhzcPDhCB745uu25r8
         s5SAfHO3YFSuPEquUROn3QNWs2A47w+aN+4ewtjwvu6hysKb3tLIdF9c3cJ2KHnSenXy
         F/u1JpdRidSf/3lMXWEDKfQDF7FuQO7TTcl6M=
MIME-Version: 1.0
In-Reply-To: 
 <OF97F01406.BCD952F9-ON8825758B.0065A609-8825758B.0067194D@us.ibm.com>
References: <e06563880904011050v5fbd1185j40a2f3ef4cf0d9ba@mail.gmail.com>
	 <OF97F01406.BCD952F9-ON8825758B.0065A609-8825758B.0067194D@us.ibm.com>
Date: Wed, 1 Apr 2009 12:37:58 -0700
Message-ID: <a06de5520904011237i551221f7h46de3b003b42e653@mail.gmail.com>
Subject: Re: handling deletes
From: Avinash Lakshman <avinash.lakshman@gmail.com>
To: cassandra-dev@incubator.apache.org
Content-Type: multipart/alternative; boundary=00163645856eeac2ec046683730d

--00163645856eeac2ec046683730d
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

In reality from 2 years of production experience in Dynamo and here with
Cassandra it is not as extreme as it seems :). Options are either strong
consistency which is hard to get right in a distributed setting. If you do
get it right then there is availability problem. All tools like read-repair
etc help in achieving eventual consistency. So I guess it boils down to what
you want from your app C or A.

Avinash

On Wed, Apr 1, 2009 at 11:46 AM, Jun Rao <junrao@almaden.ibm.com> wrote:

>
> My reply is inlined below.
>
> Jun
> IBM Almaden Research Center
> K55/B1, 650 Harry Road, San Jose, CA  95120-6099
>
> junrao@almaden.ibm.com
>
>
> Jonathan Ellis <jbellis@gmail.com> wrote on 04/01/2009 10:50:37 AM:
>
> >
> > On Wed, Apr 1, 2009 at 11:41 AM, Jun Rao <junrao@almaden.ibm.com> wrote:
> > > I am wondering if this is part of the bigger issue on data consistency.
> > >
> > > Following your example: a row x is replicated to node A, B, and C. C
> goes
> > > down. A and B delete x. When C comes back, C should contact other nodes
> that
> > > hold hinted handoff data intended for C. So, in theory, the missing
> deletion
> > > of x will be propagated to C at some point and not lost. However, the
> > > problem is that those hinted handoff nodes can die before the handoff
> > > completes. Then C need some other way to sync itself up. Node A and B
> are
> > > the only possible sources. Unfortunately, data in A and B are
> accumulated
> > > independently from C, and therefore syncing them up is a bit
> challenging.
> >
> > Right.  Or you could have a network partition when C comes back up
> > preventing the handoff.  There's lots of things that can go wrong.
> > Hence the "eventual" part of "eventually consistent." :)
> >
> > > In the short run, I am not sure if I really like the solution you
> suggested
> > > here. However, I don't have a better solution either.
> >
> > Like I said; it's not perfect, but it's better than the alternatives
> > I've seen.  I'd much rather have an imperfect solution than none at
> > all.
> >
> > > In the long run, maybe we should look into peer-to-peer replication
> > > techniques, instead of relying on hinted handoff. In P2P replication,
> an
> > > update can be directed to any replica, which will try to push it to its
> > > peers. The push will be almost real time if the peers are up. If a peer
> is
> > > down, changes for it will be accumulated and re-pushed when it's up
> again.
> > > Because an update is always initiated from one replica, it's easier to
> sync
> > > up the replicas through log shipping.
> >
> > There's a huge amount of complexity you're glossing over, though: what
> > if the replica responsible for the initiation goes down?  Then you
> > have to elect a new one.  This is (a) very complicated and (b) causes
> > loss of availability.  I prefer the existing system.  (If you want
> > consistency over availability then hbase or hypertable is a better
> > choice since that is what they design for.)
>
> P2P replication definitely adds complexity and it is just one of the
> alternatives. However, there is also complexity in hinted handoff + read
> repair + merkle tree (when it's added). Not sure which one is more
> complicated. In P2P replication, since you can initiate a write on any
> replica, you just need to pick a live replica for writes. As for
> availability, a lot have to do with how quickly a failed node is detected.
> Today, if you write to a node that's actually failed, but not yet detected
> by Cassandra, the write will also fail.
>
> Overall, I think eventual consistency is fine. However, eventual
> consistency probably shouldn't be equated to updates taking forever to show
> up. Some sort of guarantee on how outdated a piece of data is will likely
> be useful to many applications.
>
> >
> > -Jonathan

--00163645856eeac2ec046683730d--