From user-return-18396-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Mon Jul 4 00:45:52 2011 Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 685644EEF for ; Mon, 4 Jul 2011 00:45:52 +0000 (UTC) Received: (qmail 63102 invoked by uid 500); 4 Jul 2011 00:45:50 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 62935 invoked by uid 500); 4 Jul 2011 00:45:49 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 62927 invoked by uid 99); 4 Jul 2011 00:45:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Jul 2011 00:45:49 +0000 X-ASF-Spam-Status: No, hits=2.9 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [204.13.248.66] (HELO mho-01-ewr.mailhop.org) (204.13.248.66) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Jul 2011 00:45:43 +0000 Received: from 67-6-222-36.hlrn.qwest.net ([67.6.222.36] helo=[192.168.0.2]) by mho-01-ewr.mailhop.org with esmtpsa (TLSv1:CAMELLIA256-SHA:256) (Exim 4.72) (envelope-from ) id 1QdXHu-0004jv-6n for user@cassandra.apache.org; Mon, 04 Jul 2011 00:45:22 +0000 X-Mail-Handler: MailHop Outbound by DynDNS X-Originating-IP: 67.6.222.36 X-Report-Abuse-To: abuse@dyndns.com (see http://www.dyndns.com/services/mailhop/outbound_abuse.html for abuse reporting information) X-MHO-User: U2FsdGVkX1+SvVVR2Zzqmm8PXthv91e11q/nMDPGDHs= Message-ID: <4E110D1F.9070806@dude.podzone.net> Date: Sun, 03 Jul 2011 18:45:19 -0600 From: AJ User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.18) Gecko/20110616 Lightning/1.0b2 Thunderbird/3.1.11 MIME-Version: 1.0 To: user@cassandra.apache.org Subject: Re: Strong Consistency with ONE read/writes References: <4E0E7A07.2050807@dude.podzone.net> <4E10DD11.8010009@dude.podzone.net> <2D176E34-49FA-43A5-979A-2FD75F02BE15@civicscience.com> <4E10FFA8.6090103@dude.podzone.net> In-Reply-To: Content-Type: multipart/alternative; boundary="------------000801090003080609040805" This is a multi-part message in MIME format. --------------000801090003080609040805 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit On 7/3/2011 6:32 PM, William Oberman wrote: > Was just going off of: " Send the value to the primary replica and > send placeholder values to the other replicas". Sounded like you > wanted to write the value to one, and write the placeholder to N-1 to me. Yes, that is what I was suggesting. The point of the placeholders is to handle the crash case that I talked about... "like" a WAL does. > But, C* will propagate the value to N-1 eventually anyways, 'cause > that's just what it does anyways :-) > > will > > On Sun, Jul 3, 2011 at 7:47 PM, AJ > wrote: > > On 7/3/2011 3:49 PM, Will Oberman wrote: >> Why not send the value itself instead of a placeholder? Now it >> takes 2x writes on a random node to do a single update (write >> placeholder, write update) and N*x writes from the client (write >> value, write placeholder to N-1). Where N is replication factor. >> Seems like extra network and IO instead of less... > > To send the value to each node is 1.) unnecessary, 2.) will only > cause a large burst of network traffic. Think about if it's a > large data value, such as a document. Just let C* do it's thing. > The extra messages are tiny and doesn't significantly increase > latency since they are all sent asynchronously. > > >> Of course, I still think this sounds like reimplementing >> Cassandra internals in a Cassandra client (just guessing, I'm not >> a cassandra dev) >> > > I don't see how. Maybe you should take a peek at the source. > > >> >> On Jul 3, 2011, at 5:20 PM, AJ > > wrote: >> >>> Yang, >>> >>> How would you deal with the problem when the 1st node responds >>> success but then crashes before completely forwarding any >>> replicas? Then, after switching to the next primary, a read >>> would return stale data. >>> >>> Here's a quick-n-dirty way: Send the value to the primary >>> replica and send placeholder values to the other replicas. The >>> placeholder value is something like, "PENDING_UPDATE". The >>> placeholder values are sent with timestamps 1 less than the >>> timestamp for the actual value that went to the primary. Later, >>> when the changes propagate, the actual values will overwrite the >>> placeholders. In event of a crash before the placeholder gets >>> overwritten, the next read value will tell the client so. The >>> client will report to the user that the key/column is >>> unavailable. The downside is you've overwritten your data and >>> maybe would like to know what the old data was! But, maybe >>> there's another way using other columns or with MVCC. The >>> client would want a success from the primary and the secondary >>> replicas to be certain of future read consistency in case the >>> primary goes down immediately as I said above. The ability to >>> set an "update_pending" flag on any column value would probably >>> make this work. But, I'll think more on this later. >>> >>> aj >>> >>> On 7/2/2011 10:55 AM, Yang wrote: >>>> there is a JIRA completed in 0.7.x that "Prefers" a certain >>>> node in snitch, so this does roughly what you want MOST of the >>>> time >>>> >>>> >>>> but the problem is that it does not GUARANTEE that the same >>>> node will always be read. I recently read into the HBase vs >>>> Cassandra comparison thread that started after Facebook dropped >>>> Cassandra for their messaging system, and understood some of >>>> the differences. what you want is essentially what HBase does. >>>> the fundamental difference there is really due to the gossip >>>> protocol: it's a probablistic, or eventually consistent failure >>>> detector while HBase/Google Bigtable use Zookeeper/Chubby to >>>> provide a strong failure detector (a distributed lock). so in >>>> HBase, if a tablet server goes down, it really goes down, it >>>> can not re-grab the tablet from the new tablet server without >>>> going through a start up protocol (notifying the master, which >>>> would notify the clients etc), in other words it is guaranteed >>>> that one tablet is served by only one tablet server at any >>>> given time. in comparison the above JIRA only TRYIES to serve >>>> that key from one particular replica. HBase can have that >>>> guarantee because the group membership is maintained by the >>>> strong failure detector. >>>> >>>> just for hacking curiosity, a strong failure detector + >>>> Cassandra replicas is not impossible (actually seems not >>>> difficult), although the performance is not clear. what would >>>> such a strong failure detector bring to Cassandra besides this >>>> ONE-ONE strong consistency ? that is an interesting question I >>>> think. >>>> >>>> considering that HBase has been deployed on big clusters, it is >>>> probably OK with the performance of the strong Zookeeper >>>> failure detector. then a further question was: why did Dynamo >>>> originally choose to use the probablistic failure detector? yes >>>> Dynamo's main theme is "eventually consistent", so the >>>> Phi-detector is **enough**, but if a strong detector buys us >>>> more with little cost, wouldn't that be great? >>>> >>>> >>>> >>>> On Fri, Jul 1, 2011 at 6:53 PM, AJ >>> > wrote: >>>> >>>> Is this possible? >>>> >>>> All reads and writes for a given key will always go to the >>>> same node from a client. It seems the only thing needed is >>>> to allow the clients to compute which node is the closes >>>> replica for the given key using the same algorithm C* uses. >>>> When the first replica receives the write request, it will >>>> write to itself which should complete before any of the >>>> other replicas and then return. The loads should still >>>> stay balanced if using random partitioner. If the first >>>> replica becomes unavailable (however that is defined), then >>>> the clients can send to the next repilca in the ring and >>>> switch from ONE write/reads to QUORUM write/reads >>>> temporarily until the first replica becomes available >>>> again. QUORUM is required since there could be some >>>> replicas that were not updated after the first replica went >>>> down. >>>> >>>> Will this work? The goal is to have strong consistency >>>> with a read/write consistency level as low as possible >>>> while secondarily a network performance boost. >>>> >>>> >>> > > > > > -- > Will Oberman > Civic Science, Inc. > 3030 Penn Avenue., First Floor > Pittsburgh, PA 15201 > (M) 412-480-7835 > (E) oberman@civicscience.com --------------000801090003080609040805 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit On 7/3/2011 6:32 PM, William Oberman wrote:
Was just going off of: " Send the value to the primary replica and send placeholder values to the other replicas".  Sounded like you wanted to write the value to one, and write the placeholder to N-1 to me. 

Yes, that is what I was suggesting.  The point of the placeholders is to handle the crash case that I talked about... "like" a WAL does.

But, C* will propagate the value to N-1 eventually anyways, 'cause that's just what it does anyways :-)

will

On Sun, Jul 3, 2011 at 7:47 PM, AJ <aj@dude.podzone.net> wrote:
On 7/3/2011 3:49 PM, Will Oberman wrote:
Why not send the value itself instead of a placeholder?  Now it takes 2x writes on a random node to do a single update (write placeholder, write update) and N*x writes from the client (write value, write placeholder to N-1). Where N is replication factor.  Seems like extra network and IO instead of less...

To send the value to each node is 1.) unnecessary, 2.) will only cause a large burst of network traffic.  Think about if it's a large data value, such as a document.  Just let C* do it's thing.  The extra messages are tiny and doesn't significantly increase latency since they are all sent asynchronously.


 
Of course, I still think this sounds like reimplementing Cassandra internals in a Cassandra client (just guessing, I'm not a cassandra dev)  


I don't see how.  Maybe you should take a peek at the source.



On Jul 3, 2011, at 5:20 PM, AJ <aj@dude.podzone.net> wrote:

Yang,

How would you deal with the problem when the 1st node responds success but then crashes before completely forwarding any replicas?  Then, after switching to the next primary, a read would return stale data.

Here's a quick-n-dirty way:  Send the value to the primary replica and send placeholder values to the other replicas.  The placeholder value is something like, "PENDING_UPDATE".  The placeholder values are sent with timestamps 1 less than the timestamp for the actual value that went to the primary.  Later, when the changes propagate, the actual values will overwrite the placeholders.  In event of a crash before the placeholder gets overwritten, the next read value will tell the client so.  The client will report to the user that the key/column is unavailable.  The downside is you've overwritten your data and maybe would like to know what the old data was!  But, maybe there's another way using other columns or with MVCC.  The client would want a success from the primary and the secondary replicas to be certain of future read consistency in case the primary goes down immediately as I said above.  The ability to set an "update_pending" flag on any column value would probably make this work.  But, I'll think more on this later.

aj

On 7/2/2011 10:55 AM, Yang wrote:
there is a JIRA completed in 0.7.x that "Prefers" a certain node in snitch, so this does roughly what you want MOST of the time


but the problem is that it does not GUARANTEE that the same node will always be read.  I recently read into the HBase vs Cassandra comparison thread that started after Facebook dropped Cassandra for their messaging system, and understood some of the differences. what you want is essentially what HBase does. the fundamental difference there is really due to the gossip protocol: it's a probablistic, or eventually consistent failure detector  while HBase/Google Bigtable use Zookeeper/Chubby to provide a strong failure detector (a distributed lock).  so in HBase, if a tablet server goes down, it really goes down, it can not re-grab the tablet from the new tablet server without going through a start up protocol (notifying the master, which would notify the clients etc),  in other words it is guaranteed that one tablet is served by only one tablet server at any given time.  in comparison the above JIRA only TRYIES to serve that key from one particular replica. HBase can have that guarantee because the group membership is maintained by the strong failure detector.

just for hacking curiosity, a strong failure detector + Cassandra replicas is not impossible (actually seems not difficult), although the performance is not clear. what would such a strong failure detector bring to Cassandra besides this ONE-ONE strong consistency ? that is an interesting question I think. 

considering that HBase has been deployed on big clusters, it is probably OK with the performance of the strong  Zookeeper failure detector. then a further question was: why did Dynamo originally choose to use the probablistic failure detector? yes Dynamo's main theme is "eventually consistent", so the Phi-detector is **enough**, but if a strong detector buys us more with little cost, wouldn't that  be great?



On Fri, Jul 1, 2011 at 6:53 PM, AJ <aj@dude.podzone.net> wrote:
Is this possible?

All reads and writes for a given key will always go to the same node from a client.  It seems the only thing needed is to allow the clients to compute which node is the closes replica for the given key using the same algorithm C* uses.  When the first replica receives the write request, it will write to itself which should complete before any of the other replicas and then return.  The loads should still stay balanced if using random partitioner.  If the first replica becomes unavailable (however that is defined), then the clients can send to the next repilca in the ring and switch from ONE write/reads to QUORUM write/reads temporarily until the first replica becomes available again.  QUORUM is required since there could be some replicas that were not updated after the first replica went down.

Will this work?  The goal is to have strong consistency with a read/write consistency level as low as possible while secondarily a network performance boost.






--
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) oberman@civicscience.com

--------------000801090003080609040805--