incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AJ>
Subject Re: Strong Consistency with ONE read/writes
Date Mon, 04 Jul 2011 00:45:19 GMT
On 7/3/2011 6:32 PM, William Oberman wrote:
> Was just going off of: " Send the value to the primary replica and 
> send placeholder values to the other replicas".  Sounded like you 
> wanted to write the value to one, and write the placeholder to N-1 to me.

Yes, that is what I was suggesting.  The point of the placeholders is to 
handle the crash case that I talked about... "like" a WAL does.

> But, C* will propagate the value to N-1 eventually anyways, 'cause 
> that's just what it does anyways :-)
> will
> On Sun, Jul 3, 2011 at 7:47 PM, AJ < 
> <>> wrote:
>     On 7/3/2011 3:49 PM, Will Oberman wrote:
>>     Why not send the value itself instead of a placeholder?  Now it
>>     takes 2x writes on a random node to do a single update (write
>>     placeholder, write update) and N*x writes from the client (write
>>     value, write placeholder to N-1). Where N is replication factor.
>>      Seems like extra network and IO instead of less...
>     To send the value to each node is 1.) unnecessary, 2.) will only
>     cause a large burst of network traffic.  Think about if it's a
>     large data value, such as a document.  Just let C* do it's thing. 
>     The extra messages are tiny and doesn't significantly increase
>     latency since they are all sent asynchronously.
>>     Of course, I still think this sounds like reimplementing
>>     Cassandra internals in a Cassandra client (just guessing, I'm not
>>     a cassandra dev)
>     I don't see how.  Maybe you should take a peek at the source.
>>     On Jul 3, 2011, at 5:20 PM, AJ <
>>     <>> wrote:
>>>     Yang,
>>>     How would you deal with the problem when the 1st node responds
>>>     success but then crashes before completely forwarding any
>>>     replicas?  Then, after switching to the next primary, a read
>>>     would return stale data.
>>>     Here's a quick-n-dirty way:  Send the value to the primary
>>>     replica and send placeholder values to the other replicas.  The
>>>     placeholder value is something like, "PENDING_UPDATE".  The
>>>     placeholder values are sent with timestamps 1 less than the
>>>     timestamp for the actual value that went to the primary.  Later,
>>>     when the changes propagate, the actual values will overwrite the
>>>     placeholders.  In event of a crash before the placeholder gets
>>>     overwritten, the next read value will tell the client so.  The
>>>     client will report to the user that the key/column is
>>>     unavailable.  The downside is you've overwritten your data and
>>>     maybe would like to know what the old data was!  But, maybe
>>>     there's another way using other columns or with MVCC.  The
>>>     client would want a success from the primary and the secondary
>>>     replicas to be certain of future read consistency in case the
>>>     primary goes down immediately as I said above.  The ability to
>>>     set an "update_pending" flag on any column value would probably
>>>     make this work.  But, I'll think more on this later.
>>>     aj
>>>     On 7/2/2011 10:55 AM, Yang wrote:
>>>>     there is a JIRA completed in 0.7.x that "Prefers" a certain
>>>>     node in snitch, so this does roughly what you want MOST of the
>>>>     time
>>>>     but the problem is that it does not GUARANTEE that the same
>>>>     node will always be read.  I recently read into the HBase vs
>>>>     Cassandra comparison thread that started after Facebook dropped
>>>>     Cassandra for their messaging system, and understood some of
>>>>     the differences. what you want is essentially what HBase does.
>>>>     the fundamental difference there is really due to the gossip
>>>>     protocol: it's a probablistic, or eventually consistent failure
>>>>     detector  while HBase/Google Bigtable use Zookeeper/Chubby to
>>>>     provide a strong failure detector (a distributed lock).  so in
>>>>     HBase, if a tablet server goes down, it really goes down, it
>>>>     can not re-grab the tablet from the new tablet server without
>>>>     going through a start up protocol (notifying the master, which
>>>>     would notify the clients etc),  in other words it is guaranteed
>>>>     that one tablet is served by only one tablet server at any
>>>>     given time.  in comparison the above JIRA only TRYIES to serve
>>>>     that key from one particular replica. HBase can have that
>>>>     guarantee because the group membership is maintained by the
>>>>     strong failure detector.
>>>>     just for hacking curiosity, a strong failure detector +
>>>>     Cassandra replicas is not impossible (actually seems not
>>>>     difficult), although the performance is not clear. what would
>>>>     such a strong failure detector bring to Cassandra besides this
>>>>     ONE-ONE strong consistency ? that is an interesting question I
>>>>     think.
>>>>     considering that HBase has been deployed on big clusters, it is
>>>>     probably OK with the performance of the strong  Zookeeper
>>>>     failure detector. then a further question was: why did Dynamo
>>>>     originally choose to use the probablistic failure detector? yes
>>>>     Dynamo's main theme is "eventually consistent", so the
>>>>     Phi-detector is **enough**, but if a strong detector buys us
>>>>     more with little cost, wouldn't that  be great?
>>>>     On Fri, Jul 1, 2011 at 6:53 PM, AJ <
>>>>     <>> wrote:
>>>>         Is this possible?
>>>>         All reads and writes for a given key will always go to the
>>>>         same node from a client.  It seems the only thing needed is
>>>>         to allow the clients to compute which node is the closes
>>>>         replica for the given key using the same algorithm C* uses.
>>>>          When the first replica receives the write request, it will
>>>>         write to itself which should complete before any of the
>>>>         other replicas and then return.  The loads should still
>>>>         stay balanced if using random partitioner.  If the first
>>>>         replica becomes unavailable (however that is defined), then
>>>>         the clients can send to the next repilca in the ring and
>>>>         switch from ONE write/reads to QUORUM write/reads
>>>>         temporarily until the first replica becomes available
>>>>         again.  QUORUM is required since there could be some
>>>>         replicas that were not updated after the first replica went
>>>>         down.
>>>>         Will this work?  The goal is to have strong consistency
>>>>         with a read/write consistency level as low as possible
>>>>         while secondarily a network performance boost.
> -- 
> Will Oberman
> Civic Science, Inc.
> 3030 Penn Avenue., First Floor
> Pittsburgh, PA 15201
> (M) 412-480-7835
> (E) <>

View raw message