incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Will Oberman <>
Subject Re: Strong Consistency with ONE read/writes
Date Sun, 03 Jul 2011 21:49:22 GMT
Why not send the value itself instead of a placeholder?  Now it takes  
2x writes on a random node to do a single update (write placeholder,  
write update) and N*x writes from the client (write value, write  
placeholder to N-1). Where N is replication factor.  Seems like extra  
network and IO instead of less...

Of course, I still think this sounds like reimplementing Cassandra  
internals in a Cassandra client (just guessing, I'm not a cassandra dev)

On Jul 3, 2011, at 5:20 PM, AJ <> wrote:

> Yang,
> How would you deal with the problem when the 1st node responds  
> success but then crashes before completely forwarding any replicas?   
> Then, after switching to the next primary, a read would return stale  
> data.
> Here's a quick-n-dirty way:  Send the value to the primary replica  
> and send placeholder values to the other replicas.  The placeholder  
> value is something like, "PENDING_UPDATE".  The placeholder values  
> are sent with timestamps 1 less than the timestamp for the actual  
> value that went to the primary.  Later, when the changes propagate,  
> the actual values will overwrite the placeholders.  In event of a  
> crash before the placeholder gets overwritten, the next read value  
> will tell the client so.  The client will report to the user that  
> the key/column is unavailable.  The downside is you've overwritten  
> your data and maybe would like to know what the old data was!  But,  
> maybe there's another way using other columns or with MVCC.  The  
> client would want a success from the primary and the secondary  
> replicas to be certain of future read consistency in case the  
> primary goes down immediately as I said above.  The ability to set  
> an "update_pending" flag on any column value would probably make  
> this work.  But, I'll think more on this later.
> aj
> On 7/2/2011 10:55 AM, Yang wrote:
>> there is a JIRA completed in 0.7.x that "Prefers" a certain node in  
>> snitch, so this does roughly what you want MOST of the time
>> but the problem is that it does not GUARANTEE that the same node  
>> will always be read.  I recently read into the HBase vs Cassandra  
>> comparison thread that started after Facebook dropped Cassandra for  
>> their messaging system, and understood some of the differences.  
>> what you want is essentially what HBase does. the fundamental  
>> difference there is really due to the gossip protocol: it's a  
>> probablistic, or eventually consistent failure detector  while  
>> HBase/Google Bigtable use Zookeeper/Chubby to provide a strong  
>> failure detector (a distributed lock).  so in HBase, if a tablet  
>> server goes down, it really goes down, it can not re-grab the  
>> tablet from the new tablet server without going through a start up  
>> protocol (notifying the master, which would notify the clients  
>> etc),  in other words it is guaranteed that one tablet is served by  
>> only one tablet server at any given time.  in comparison the above  
>> JIRA only TRYIES to serve that key from one particular replica.  
>> HBase can have that guarantee because the group membership is  
>> maintained by the strong failure detector.
>> just for hacking curiosity, a strong failure detector + Cassandra  
>> replicas is not impossible (actually seems not difficult), although  
>> the performance is not clear. what would such a strong failure  
>> detector bring to Cassandra besides this ONE-ONE strong  
>> consistency ? that is an interesting question I think.
>> considering that HBase has been deployed on big clusters, it is  
>> probably OK with the performance of the strong  Zookeeper failure  
>> detector. then a further question was: why did Dynamo originally  
>> choose to use the probablistic failure detector? yes Dynamo's main  
>> theme is "eventually consistent", so the Phi-detector is  
>> **enough**, but if a strong detector buys us more with little cost,  
>> wouldn't that  be great?
>> On Fri, Jul 1, 2011 at 6:53 PM, AJ <> wrote:
>> Is this possible?
>> All reads and writes for a given key will always go to the same  
>> node from a client.  It seems the only thing needed is to allow the  
>> clients to compute which node is the closes replica for the given  
>> key using the same algorithm C* uses.  When the first replica  
>> receives the write request, it will write to itself which should  
>> complete before any of the other replicas and then return.  The  
>> loads should still stay balanced if using random partitioner.  If  
>> the first replica becomes unavailable (however that is defined),  
>> then the             clients can send to the next repilca in the  
>> ring and switch from ONE write/reads to QUORUM write/reads  
>> temporarily until the first replica becomes available again.   
>> QUORUM is required since there could be some replicas that were not  
>> updated after the first replica went down.
>> Will this work?  The goal is to have strong consistency with a read/ 
>> write consistency level as low as possible while secondarily a  
>> network performance boost.

View raw message