cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AJ>
Subject Re: Strong Consistency with ONE read/writes
Date Sun, 03 Jul 2011 21:20:17 GMT

How would you deal with the problem when the 1st node responds success 
but then crashes before completely forwarding any replicas?  Then, after 
switching to the next primary, a read would return stale data.

Here's a quick-n-dirty way:  Send the value to the primary replica and 
send placeholder values to the other replicas.  The placeholder value is 
something like, "PENDING_UPDATE".  The placeholder values are sent with 
timestamps 1 less than the timestamp for the actual value that went to 
the primary.  Later, when the changes propagate, the actual values will 
overwrite the placeholders.  In event of a crash before the placeholder 
gets overwritten, the next read value will tell the client so.  The 
client will report to the user that the key/column is unavailable.  The 
downside is you've overwritten your data and maybe would like to know 
what the old data was!  But, maybe there's another way using other 
columns or with MVCC.  The client would want a success from the primary 
and the secondary replicas to be certain of future read consistency in 
case the primary goes down immediately as I said above.  The ability to 
set an "update_pending" flag on any column value would probably make 
this work.  But, I'll think more on this later.


On 7/2/2011 10:55 AM, Yang wrote:
> there is a JIRA completed in 0.7.x that "Prefers" a certain node in 
> snitch, so this does roughly what you want MOST of the time
> but the problem is that it does not GUARANTEE that the same node will 
> always be read.  I recently read into the HBase vs Cassandra 
> comparison thread that started after Facebook dropped Cassandra for 
> their messaging system, and understood some of the differences. what 
> you want is essentially what HBase does. the fundamental difference 
> there is really due to the gossip protocol: it's a probablistic, or 
> eventually consistent failure detector  while HBase/Google Bigtable 
> use Zookeeper/Chubby to provide a strong failure detector (a 
> distributed lock).  so in HBase, if a tablet server goes down, it 
> really goes down, it can not re-grab the tablet from the new tablet 
> server without going through a start up protocol (notifying the 
> master, which would notify the clients etc),  in other words it is 
> guaranteed that one tablet is served by only one tablet server at any 
> given time.  in comparison the above JIRA only TRYIES to serve that 
> key from one particular replica. HBase can have that guarantee because 
> the group membership is maintained by the strong failure detector.
> just for hacking curiosity, a strong failure detector + Cassandra 
> replicas is not impossible (actually seems not difficult), although 
> the performance is not clear. what would such a strong failure 
> detector bring to Cassandra besides this ONE-ONE strong consistency ? 
> that is an interesting question I think.
> considering that HBase has been deployed on big clusters, it is 
> probably OK with the performance of the strong  Zookeeper failure 
> detector. then a further question was: why did Dynamo originally 
> choose to use the probablistic failure detector? yes Dynamo's main 
> theme is "eventually consistent", so the Phi-detector is **enough**, 
> but if a strong detector buys us more with little cost, wouldn't that 
>  be great?
> On Fri, Jul 1, 2011 at 6:53 PM, AJ < 
> <>> wrote:
>     Is this possible?
>     All reads and writes for a given key will always go to the same
>     node from a client.  It seems the only thing needed is to allow
>     the clients to compute which node is the closes replica for the
>     given key using the same algorithm C* uses.  When the first
>     replica receives the write request, it will write to itself which
>     should complete before any of the other replicas and then return.
>      The loads should still stay balanced if using random partitioner.
>      If the first replica becomes unavailable (however that is
>     defined), then the clients can send to the next repilca in the
>     ring and switch from ONE write/reads to QUORUM write/reads
>     temporarily until the first replica becomes available again.
>      QUORUM is required since there could be some replicas that were
>     not updated after the first replica went down.
>     Will this work?  The goal is to have strong consistency with a
>     read/write consistency level as low as possible while secondarily
>     a network performance boost.

View raw message