cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AJ ...@dude.podzone.net>
Subject Re: Strong Consistency with ONE read/writes
Date Sun, 03 Jul 2011 23:52:17 GMT
On 7/3/2011 4:07 PM, Yang wrote:
>
> I'm no expert. So addressing the question to me probably give you real 
> answers :)
>
> The single entry mode makes sure that all writes coming through the 
> leader are received by replicas before ack to client. Probably wont be 
> stale data
>

That doesn't sound any different than a TWO write.  I'm trying to save a 
hop (+ 1 data xfer) by ack'ing immediately after the primary 
successfully writes, i.e., ONE write.

> On Jul 3, 2011 11:20 AM, "AJ" <aj@dude.podzone.net 
> <mailto:aj@dude.podzone.net>> wrote:
> > Yang,
> >
> > How would you deal with the problem when the 1st node responds success
> > but then crashes before completely forwarding any replicas? Then, after
> > switching to the next primary, a read would return stale data.
> >
> > Here's a quick-n-dirty way: Send the value to the primary replica and
> > send placeholder values to the other replicas. The placeholder value is
> > something like, "PENDING_UPDATE". The placeholder values are sent with
> > timestamps 1 less than the timestamp for the actual value that went to
> > the primary. Later, when the changes propagate, the actual values will
> > overwrite the placeholders. In event of a crash before the placeholder
> > gets overwritten, the next read value will tell the client so. The
> > client will report to the user that the key/column is unavailable. The
> > downside is you've overwritten your data and maybe would like to know
> > what the old data was! But, maybe there's another way using other
> > columns or with MVCC. The client would want a success from the primary
> > and the secondary replicas to be certain of future read consistency in
> > case the primary goes down immediately as I said above. The ability to
> > set an "update_pending" flag on any column value would probably make
> > this work. But, I'll think more on this later.
> >
> > aj
> >
> > On 7/2/2011 10:55 AM, Yang wrote:
> >> there is a JIRA completed in 0.7.x that "Prefers" a certain node in
> >> snitch, so this does roughly what you want MOST of the time
> >>
> >>
> >> but the problem is that it does not GUARANTEE that the same node will
> >> always be read. I recently read into the HBase vs Cassandra
> >> comparison thread that started after Facebook dropped Cassandra for
> >> their messaging system, and understood some of the differences. what
> >> you want is essentially what HBase does. the fundamental difference
> >> there is really due to the gossip protocol: it's a probablistic, or
> >> eventually consistent failure detector while HBase/Google Bigtable
> >> use Zookeeper/Chubby to provide a strong failure detector (a
> >> distributed lock). so in HBase, if a tablet server goes down, it
> >> really goes down, it can not re-grab the tablet from the new tablet
> >> server without going through a start up protocol (notifying the
> >> master, which would notify the clients etc), in other words it is
> >> guaranteed that one tablet is served by only one tablet server at any
> >> given time. in comparison the above JIRA only TRYIES to serve that
> >> key from one particular replica. HBase can have that guarantee because
> >> the group membership is maintained by the strong failure detector.
> >>
> >> just for hacking curiosity, a strong failure detector + Cassandra
> >> replicas is not impossible (actually seems not difficult), although
> >> the performance is not clear. what would such a strong failure
> >> detector bring to Cassandra besides this ONE-ONE strong consistency ?
> >> that is an interesting question I think.
> >>
> >> considering that HBase has been deployed on big clusters, it is
> >> probably OK with the performance of the strong Zookeeper failure
> >> detector. then a further question was: why did Dynamo originally
> >> choose to use the probablistic failure detector? yes Dynamo's main
> >> theme is "eventually consistent", so the Phi-detector is **enough**,
> >> but if a strong detector buys us more with little cost, wouldn't that
> >> be great?
> >>
> >>
> >>
> >> On Fri, Jul 1, 2011 at 6:53 PM, AJ <aj@dude.podzone.net 
> <mailto:aj@dude.podzone.net>
> >> <mailto:aj@dude.podzone.net <mailto:aj@dude.podzone.net>>> wrote:
> >>
> >> Is this possible?
> >>
> >> All reads and writes for a given key will always go to the same
> >> node from a client. It seems the only thing needed is to allow
> >> the clients to compute which node is the closes replica for the
> >> given key using the same algorithm C* uses. When the first
> >> replica receives the write request, it will write to itself which
> >> should complete before any of the other replicas and then return.
> >> The loads should still stay balanced if using random partitioner.
> >> If the first replica becomes unavailable (however that is
> >> defined), then the clients can send to the next repilca in the
> >> ring and switch from ONE write/reads to QUORUM write/reads
> >> temporarily until the first replica becomes available again.
> >> QUORUM is required since there could be some replicas that were
> >> not updated after the first replica went down.
> >>
> >> Will this work? The goal is to have strong consistency with a
> >> read/write consistency level as low as possible while secondarily
> >> a network performance boost.
> >>
> >>
> >


Mime
View raw message