cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AJ>
Subject Re: Strong Consistency with ONE read/writes
Date Sat, 02 Jul 2011 21:30:23 GMT
Yang, you seem to understand all of the details, at least the details 
that have occurred to me, such as having a failure protocol rather than 
a perfect failure detector and new leader coordination.

I finally did some more reading outside of Cassandra space and realized 
HBase has what I was asking about.  If Cass could be flexible enough to 
allow such a setup without violating it's goals, that would be great, imho.

This thread is just a brainstorming exploratory thread (by a non-expert) 
based on a simplistic observation that, if all clients went directly to 
the responsible replica every time, then performance and consistency can 
be increased by:

- providing guaranteed monotonic reads/writes consistency
- read-your-writes consistency
- higher performance (less latency)

all with only a read/write of ONE.

Basically, it's like a mater/slave setup except that the slaves can 
take-over as master, so you still have high availability.

I'm not saying it's easy and I'm only coming at this from a customer 
request point of view.  The question is, would this be useful if it 
could be added to Cass's bag of tricks?  Cass is already a hybrid.


On 7/2/2011 1:57 PM, Yang wrote:
> Jonathan:
> could you please elaborate more on specifically why they are "not even 
> close"?
>  --- I kind of see what you mean (please correct me if I 
> misunderstood): Cassandra failure detector
> is consulted on every write; while HBase failure detector is only used 
> when the tablet server joins or leaves.
>  in order to have the single write entry point approach originally 
> brought up in this thread,
> I think you need a strong membership protocol to lock on the key range 
>  leadership, once leadership is acquired,
> failure detectors do not need to be consulted on every write.
> yes by definition of the original requirement brought up in this thread,
> Cassandra's write behavior is going to be changed, to be more like 
> Hbase, and mongo in "replica set" mode. but
> it seems that this leader mode can even co-exist with the multi-entry 
> write mode that Cassandra uses now, just as
> you can use different CL for each single write request.  in that case 
> you would need to keep both the current lightweight Phi-detector
> and add the ZK for leader election for single-entry mode write.
> Thanks
> Yang
> (I should correct my terminology .... it's not a "strong failure 
> detector" that's needed, it's a "strong membership protocol". strongly 
> complete and accurate failure detectors do not exist in
> async distributed systems (Tushar Chandra  "Unreliable Failure 
> Detectors for Reliable Distributed Systems, Journal of the ACM, 
> 43(2):225-267, 1996 <>"  and 
> FLP "Impossibility of  Distributed Consensus with One Faulty Process 
> <>" )  )
> On Sat, Jul 2, 2011 at 10:11 AM, Jonathan Ellis < 
> <>> wrote:
>     The way HBase uses ZK (for master election) is not even close to how
>     Cassandra uses the failure detector.
>     Using ZK for each operation would (a) not scale and (b) not work
>     cross-DC for any reasonable latency requirements.
>     On Sat, Jul 2, 2011 at 11:55 AM, Yang <
>     <>> wrote:
>     > there is a JIRA completed in 0.7.x that "Prefers" a certain node
>     in snitch,
>     > so this does roughly what you want MOST of the time
>     >
>     > but the problem is that it does not GUARANTEE that the same node
>     will always
>     > be read.  I recently read into the HBase vs Cassandra comparison
>     thread that
>     > started after Facebook dropped Cassandra for their messaging
>     system, and
>     > understood some of the differences. what you want is essentially
>     what HBase
>     > does. the fundamental difference there is really due to the
>     gossip protocol:
>     > it's a probablistic, or eventually consistent failure detector
>      while
>     > HBase/Google Bigtable use Zookeeper/Chubby to provide a strong
>     failure
>     > detector (a distributed lock).  so in HBase, if a tablet server
>     goes down,
>     > it really goes down, it can not re-grab the tablet from the new
>     tablet
>     > server without going through a start up protocol (notifying the
>     master,
>     > which would notify the clients etc),  in other words it is
>     guaranteed that
>     > one tablet is served by only one tablet server at any given
>     time.  in
>     > comparison the above JIRA only TRYIES to serve that key from one
>     particular
>     > replica. HBase can have that guarantee because the group
>     membership is
>     > maintained by the strong failure detector.
>     > just for hacking curiosity, a strong failure detector +
>     Cassandra replicas
>     > is not impossible (actually seems not difficult), although the
>     performance
>     > is not clear. what would such a strong failure detector bring to
>     Cassandra
>     > besides this ONE-ONE strong consistency ? that is an interesting
>     question I
>     > think.
>     > considering that HBase has been deployed on big clusters, it is
>     probably OK
>     > with the performance of the strong  Zookeeper failure detector.
>     then a
>     > further question was: why did Dynamo originally choose to use the
>     > probablistic failure detector? yes Dynamo's main theme is
>     "eventually
>     > consistent", so the Phi-detector is **enough**, but if a strong
>     detector
>     > buys us more with little cost, wouldn't that  be great?
>     >
>     >
>     > On Fri, Jul 1, 2011 at 6:53 PM, AJ <
>     <>> wrote:
>     >>
>     >> Is this possible?
>     >>
>     >> All reads and writes for a given key will always go to the same
>     node from
>     >> a client.  It seems the only thing needed is to allow the
>     clients to compute
>     >> which node is the closes replica for the given key using the
>     same algorithm
>     >> C* uses.  When the first replica receives the write request, it
>     will write
>     >> to itself which should complete before any of the other
>     replicas and then
>     >> return.  The loads should still stay balanced if using random
>     partitioner.
>     >>  If the first replica becomes unavailable (however that is
>     defined), then
>     >> the clients can send to the next repilca in the ring and switch
>     from ONE
>     >> write/reads to QUORUM write/reads temporarily until the first
>     replica
>     >> becomes available again.  QUORUM is required since there could
>     be some
>     >> replicas that were not updated after the first replica went down.
>     >>
>     >> Will this work?  The goal is to have strong consistency with a
>     read/write
>     >> consistency level as low as possible while secondarily a
>     network performance
>     >> boost.
>     >
>     >
>     --
>     Jonathan Ellis
>     Project Chair, Apache Cassandra
>     co-founder of DataStax, the source for professional Cassandra support

View raw message