cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From William Oberman <>
Subject Re: Strong Consistency with ONE read/writes
Date Mon, 04 Jul 2011 00:32:09 GMT
Was just going off of: "Send the value to the primary replica and send
placeholder values to the other replicas".  Sounded like you wanted to write
the value to one, and write the placeholder to N-1 to me.  But, C* will
propagate the value to N-1 eventually anyways, 'cause that's just what it
does anyways :-)


On Sun, Jul 3, 2011 at 7:47 PM, AJ <> wrote:

> **
> On 7/3/2011 3:49 PM, Will Oberman wrote:
> Why not send the value itself instead of a placeholder?  Now it takes 2x
> writes on a random node to do a single update (write placeholder, write
> update) and N*x writes from the client (write value, write placeholder to
> N-1). Where N is replication factor.  Seems like extra network and IO
> instead of less...
> To send the value to each node is 1.) unnecessary, 2.) will only cause a
> large burst of network traffic.  Think about if it's a large data value,
> such as a document.  Just let C* do it's thing.  The extra messages are tiny
> and doesn't significantly increase latency since they are all sent
> asynchronously.
> Of course, I still think this sounds like reimplementing Cassandra
> internals in a Cassandra client (just guessing, I'm not a cassandra dev)
> I don't see how.  Maybe you should take a peek at the source.
> On Jul 3, 2011, at 5:20 PM, AJ <> wrote:
>   Yang,
> How would you deal with the problem when the 1st node responds success but
> then crashes before completely forwarding any replicas?  Then, after
> switching to the next primary, a read would return stale data.
> Here's a quick-n-dirty way:  Send the value to the primary replica and send
> placeholder values to the other replicas.  The placeholder value is
> something like, "PENDING_UPDATE".  The placeholder values are sent with
> timestamps 1 less than the timestamp for the actual value that went to the
> primary.  Later, when the changes propagate, the actual values will
> overwrite the placeholders.  In event of a crash before the placeholder gets
> overwritten, the next read value will tell the client so.  The client will
> report to the user that the key/column is unavailable.  The downside is
> you've overwritten your data and maybe would like to know what the old data
> was!  But, maybe there's another way using other columns or with MVCC.  The
> client would want a success from the primary and the secondary replicas to
> be certain of future read consistency in case the primary goes down
> immediately as I said above.  The ability to set an "update_pending" flag on
> any column value would probably make this work.  But, I'll think more on
> this later.
> aj
> On 7/2/2011 10:55 AM, Yang wrote:
> there is a JIRA completed in 0.7.x that "Prefers" a certain node in snitch,
> so this does roughly what you want MOST of the time
>  but the problem is that it does not GUARANTEE that the same node will
> always be read.  I recently read into the HBase vs Cassandra comparison
> thread that started after Facebook dropped Cassandra for their messaging
> system, and understood some of the differences. what you want is essentially
> what HBase does. the fundamental difference there is really due to the
> gossip protocol: it's a probablistic, or eventually consistent failure
> detector  while HBase/Google Bigtable use Zookeeper/Chubby to provide a
> strong failure detector (a distributed lock).  so in HBase, if a tablet
> server goes down, it really goes down, it can not re-grab the tablet from
> the new tablet server without going through a start up protocol (notifying
> the master, which would notify the clients etc),  in other words it is
> guaranteed that one tablet is served by only one tablet server at any given
> time.  in comparison the above JIRA only TRYIES to serve that key from one
> particular replica. HBase can have that guarantee because the group
> membership is maintained by the strong failure detector.
>  just for hacking curiosity, a strong failure detector + Cassandra
> replicas is not impossible (actually seems not difficult), although the
> performance is not clear. what would such a strong failure detector bring to
> Cassandra besides this ONE-ONE strong consistency ? that is an interesting
> question I think.
>  considering that HBase has been deployed on big clusters, it is probably
> OK with the performance of the strong  Zookeeper failure detector. then a
> further question was: why did Dynamo originally choose to use the
> probablistic failure detector? yes Dynamo's main theme is "eventually
> consistent", so the Phi-detector is **enough**, but if a strong detector
> buys us more with little cost, wouldn't that  be great?
> On Fri, Jul 1, 2011 at 6:53 PM, AJ <> wrote:
>> Is this possible?
>> All reads and writes for a given key will always go to the same node from
>> a client.  It seems the only thing needed is to allow the clients to compute
>> which node is the closes replica for the given key using the same algorithm
>> C* uses.  When the first replica receives the write request, it will write
>> to itself which should complete before any of the other replicas and then
>> return.  The loads should still stay balanced if using random partitioner.
>>  If the first replica becomes unavailable (however that is defined), then
>> the clients can send to the next repilca in the ring and switch from ONE
>> write/reads to QUORUM write/reads temporarily until the first replica
>> becomes available again.  QUORUM is required since there could be some
>> replicas that were not updated after the first replica went down.
>> Will this work?  The goal is to have strong consistency with a read/write
>> consistency level as low as possible while secondarily a network performance
>> boost.

Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835

View raw message