incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AJ>
Subject Re: Strong Consistency with ONE read/writes
Date Sun, 03 Jul 2011 23:47:52 GMT
On 7/3/2011 3:49 PM, Will Oberman wrote:
> Why not send the value itself instead of a placeholder?  Now it takes 
> 2x writes on a random node to do a single update (write placeholder, 
> write update) and N*x writes from the client (write value, write 
> placeholder to N-1). Where N is replication factor.  Seems like extra 
> network and IO instead of less...

To send the value to each node is 1.) unnecessary, 2.) will only cause a 
large burst of network traffic.  Think about if it's a large data value, 
such as a document.  Just let C* do it's thing.  The extra messages are 
tiny and doesn't significantly increase latency since they are all sent 

> Of course, I still think this sounds like reimplementing Cassandra 
> internals in a Cassandra client (just guessing, I'm not a cassandra dev)

I don't see how.  Maybe you should take a peek at the source.

> On Jul 3, 2011, at 5:20 PM, AJ < 
> <>> wrote:
>> Yang,
>> How would you deal with the problem when the 1st node responds 
>> success but then crashes before completely forwarding any replicas?  
>> Then, after switching to the next primary, a read would return stale 
>> data.
>> Here's a quick-n-dirty way:  Send the value to the primary replica 
>> and send placeholder values to the other replicas.  The placeholder 
>> value is something like, "PENDING_UPDATE".  The placeholder values 
>> are sent with timestamps 1 less than the timestamp for the actual 
>> value that went to the primary.  Later, when the changes propagate, 
>> the actual values will overwrite the placeholders.  In event of a 
>> crash before the placeholder gets overwritten, the next read value 
>> will tell the client so.  The client will report to the user that the 
>> key/column is unavailable.  The downside is you've overwritten your 
>> data and maybe would like to know what the old data was!  But, maybe 
>> there's another way using other columns or with MVCC.  The client 
>> would want a success from the primary and the secondary replicas to 
>> be certain of future read consistency in case the primary goes down 
>> immediately as I said above.  The ability to set an "update_pending" 
>> flag on any column value would probably make this work.  But, I'll 
>> think more on this later.
>> aj
>> On 7/2/2011 10:55 AM, Yang wrote:
>>> there is a JIRA completed in 0.7.x that "Prefers" a certain node in 
>>> snitch, so this does roughly what you want MOST of the time
>>> but the problem is that it does not GUARANTEE that the same node 
>>> will always be read.  I recently read into the HBase vs Cassandra 
>>> comparison thread that started after Facebook dropped Cassandra for 
>>> their messaging system, and understood some of the differences. what 
>>> you want is essentially what HBase does. the fundamental difference 
>>> there is really due to the gossip protocol: it's a probablistic, or 
>>> eventually consistent failure detector  while HBase/Google Bigtable 
>>> use Zookeeper/Chubby to provide a strong failure detector (a 
>>> distributed lock).  so in HBase, if a tablet server goes down, it 
>>> really goes down, it can not re-grab the tablet from the new tablet 
>>> server without going through a start up protocol (notifying the 
>>> master, which would notify the clients etc),  in other words it is 
>>> guaranteed that one tablet is served by only one tablet server at 
>>> any given time.  in comparison the above JIRA only TRYIES to serve 
>>> that key from one particular replica. HBase can have that guarantee 
>>> because the group membership is maintained by the strong failure 
>>> detector.
>>> just for hacking curiosity, a strong failure detector + Cassandra 
>>> replicas is not impossible (actually seems not difficult), although 
>>> the performance is not clear. what would such a strong failure 
>>> detector bring to Cassandra besides this ONE-ONE strong consistency 
>>> ? that is an interesting question I think.
>>> considering that HBase has been deployed on big clusters, it is 
>>> probably OK with the performance of the strong  Zookeeper failure 
>>> detector. then a further question was: why did Dynamo originally 
>>> choose to use the probablistic failure detector? yes Dynamo's main 
>>> theme is "eventually consistent", so the Phi-detector is **enough**, 
>>> but if a strong detector buys us more with little cost, wouldn't 
>>> that  be great?
>>> On Fri, Jul 1, 2011 at 6:53 PM, AJ < 
>>> <>> wrote:
>>>     Is this possible?
>>>     All reads and writes for a given key will always go to the same
>>>     node from a client.  It seems the only thing needed is to allow
>>>     the clients to compute which node is the closes replica for the
>>>     given key using the same algorithm C* uses.  When the first
>>>     replica receives the write request, it will write to itself
>>>     which should complete before any of the other replicas and then
>>>     return.  The loads should still stay balanced if using random
>>>     partitioner.  If the first replica becomes unavailable (however
>>>     that is defined), then the clients can send to the next repilca
>>>     in the ring and switch from ONE write/reads to QUORUM
>>>     write/reads temporarily until the first replica becomes
>>>     available again.  QUORUM is required since there could be some
>>>     replicas that were not updated after the first replica went down.
>>>     Will this work?  The goal is to have strong consistency with a
>>>     read/write consistency level as low as possible while
>>>     secondarily a network performance boost.

View raw message