Why not send the value itself instead of a placeholder? Now it takes 2x writes on a random node to do a single update (write placeholder, write update) and N*x writes from the client (write value, write placeholder to N-1). Where N is replication factor. Seems like extra network and IO instead of less... Of course, I still think this sounds like reimplementing Cassandra internals in a Cassandra client (just guessing, I'm not a cassandra dev) On Jul 3, 2011, at 5:20 PM, AJ wrote: > Yang, > > How would you deal with the problem when the 1st node responds > success but then crashes before completely forwarding any replicas? > Then, after switching to the next primary, a read would return stale > data. > > Here's a quick-n-dirty way: Send the value to the primary replica > and send placeholder values to the other replicas. The placeholder > value is something like, "PENDING_UPDATE". The placeholder values > are sent with timestamps 1 less than the timestamp for the actual > value that went to the primary. Later, when the changes propagate, > the actual values will overwrite the placeholders. In event of a > crash before the placeholder gets overwritten, the next read value > will tell the client so. The client will report to the user that > the key/column is unavailable. The downside is you've overwritten > your data and maybe would like to know what the old data was! But, > maybe there's another way using other columns or with MVCC. The > client would want a success from the primary and the secondary > replicas to be certain of future read consistency in case the > primary goes down immediately as I said above. The ability to set > an "update_pending" flag on any column value would probably make > this work. But, I'll think more on this later. > > aj > > On 7/2/2011 10:55 AM, Yang wrote: >> >> there is a JIRA completed in 0.7.x that "Prefers" a certain node in >> snitch, so this does roughly what you want MOST of the time >> >> >> but the problem is that it does not GUARANTEE that the same node >> will always be read. I recently read into the HBase vs Cassandra >> comparison thread that started after Facebook dropped Cassandra for >> their messaging system, and understood some of the differences. >> what you want is essentially what HBase does. the fundamental >> difference there is really due to the gossip protocol: it's a >> probablistic, or eventually consistent failure detector while >> HBase/Google Bigtable use Zookeeper/Chubby to provide a strong >> failure detector (a distributed lock). so in HBase, if a tablet >> server goes down, it really goes down, it can not re-grab the >> tablet from the new tablet server without going through a start up >> protocol (notifying the master, which would notify the clients >> etc), in other words it is guaranteed that one tablet is served by >> only one tablet server at any given time. in comparison the above >> JIRA only TRYIES to serve that key from one particular replica. >> HBase can have that guarantee because the group membership is >> maintained by the strong failure detector. >> >> just for hacking curiosity, a strong failure detector + Cassandra >> replicas is not impossible (actually seems not difficult), although >> the performance is not clear. what would such a strong failure >> detector bring to Cassandra besides this ONE-ONE strong >> consistency ? that is an interesting question I think. >> >> considering that HBase has been deployed on big clusters, it is >> probably OK with the performance of the strong Zookeeper failure >> detector. then a further question was: why did Dynamo originally >> choose to use the probablistic failure detector? yes Dynamo's main >> theme is "eventually consistent", so the Phi-detector is >> **enough**, but if a strong detector buys us more with little cost, >> wouldn't that be great? >> >> >> >> On Fri, Jul 1, 2011 at 6:53 PM, AJ wrote: >> Is this possible? >> >> All reads and writes for a given key will always go to the same >> node from a client. It seems the only thing needed is to allow the >> clients to compute which node is the closes replica for the given >> key using the same algorithm C* uses. When the first replica >> receives the write request, it will write to itself which should >> complete before any of the other replicas and then return. The >> loads should still stay balanced if using random partitioner. If >> the first replica becomes unavailable (however that is defined), >> then the clients can send to the next repilca in the >> ring and switch from ONE write/reads to QUORUM write/reads >> temporarily until the first replica becomes available again. >> QUORUM is required since there could be some replicas that were not >> updated after the first replica went down. >> >> Will this work? The goal is to have strong consistency with a read/ >> write consistency level as low as possible while secondarily a >> network performance boost. >> >