incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From William Oberman <ober...@civicscience.com>
Subject Re: Strong Consistency with ONE read/writes
Date Mon, 04 Jul 2011 02:28:04 GMT
I'm using cassandra as a tool, like a black box with a certain contract to
the world.  Without modifying the "core", C* will send the updates to all
replicas, so your plan would cause the extra write (for the placeholder).  I
wasn't assuming a modification to how C* fundamentally works.

Sounds like you are hacking (or at least looking) at the source, so all the
power to you if/when you try these kind of changes.

will

On Sun, Jul 3, 2011 at 8:45 PM, AJ <aj@dude.podzone.net> wrote:

> **
> On 7/3/2011 6:32 PM, William Oberman wrote:
>
> Was just going off of: " Send the value to the primary replica and send
> placeholder values to the other replicas".  Sounded like you wanted to write
> the value to one, and write the placeholder to N-1 to me.
>
>
> Yes, that is what I was suggesting.  The point of the placeholders is to
> handle the crash case that I talked about... "like" a WAL does.
>
>
> But, C* will propagate the value to N-1 eventually anyways, 'cause that's
> just what it does anyways :-)
>
>  will
>
> On Sun, Jul 3, 2011 at 7:47 PM, AJ <aj@dude.podzone.net> wrote:
>
>>  On 7/3/2011 3:49 PM, Will Oberman wrote:
>>
>> Why not send the value itself instead of a placeholder?  Now it takes 2x
>> writes on a random node to do a single update (write placeholder, write
>> update) and N*x writes from the client (write value, write placeholder to
>> N-1). Where N is replication factor.  Seems like extra network and IO
>> instead of less...
>>
>>
>>  To send the value to each node is 1.) unnecessary, 2.) will only cause a
>> large burst of network traffic.  Think about if it's a large data value,
>> such as a document.  Just let C* do it's thing.  The extra messages are tiny
>> and doesn't significantly increase latency since they are all sent
>> asynchronously.
>>
>>
>>
>> Of course, I still think this sounds like reimplementing Cassandra
>> internals in a Cassandra client (just guessing, I'm not a cassandra dev)
>>
>>
>>
>>  I don't see how.  Maybe you should take a peek at the source.
>>
>>
>>
>> On Jul 3, 2011, at 5:20 PM, AJ <aj@dude.podzone.net> wrote:
>>
>>   Yang,
>>
>> How would you deal with the problem when the 1st node responds success but
>> then crashes before completely forwarding any replicas?  Then, after
>> switching to the next primary, a read would return stale data.
>>
>> Here's a quick-n-dirty way:  Send the value to the primary replica and
>> send placeholder values to the other replicas.  The placeholder value is
>> something like, "PENDING_UPDATE".  The placeholder values are sent with
>> timestamps 1 less than the timestamp for the actual value that went to the
>> primary.  Later, when the changes propagate, the actual values will
>> overwrite the placeholders.  In event of a crash before the placeholder gets
>> overwritten, the next read value will tell the client so.  The client will
>> report to the user that the key/column is unavailable.  The downside is
>> you've overwritten your data and maybe would like to know what the old data
>> was!  But, maybe there's another way using other columns or with MVCC.  The
>> client would want a success from the primary and the secondary replicas to
>> be certain of future read consistency in case the primary goes down
>> immediately as I said above.  The ability to set an "update_pending" flag on
>> any column value would probably make this work.  But, I'll think more on
>> this later.
>>
>> aj
>>
>> On 7/2/2011 10:55 AM, Yang wrote:
>>
>> there is a JIRA completed in 0.7.x that "Prefers" a certain node in
>> snitch, so this does roughly what you want MOST of the time
>>
>>
>>  but the problem is that it does not GUARANTEE that the same node will
>> always be read.  I recently read into the HBase vs Cassandra comparison
>> thread that started after Facebook dropped Cassandra for their messaging
>> system, and understood some of the differences. what you want is essentially
>> what HBase does. the fundamental difference there is really due to the
>> gossip protocol: it's a probablistic, or eventually consistent failure
>> detector  while HBase/Google Bigtable use Zookeeper/Chubby to provide a
>> strong failure detector (a distributed lock).  so in HBase, if a tablet
>> server goes down, it really goes down, it can not re-grab the tablet from
>> the new tablet server without going through a start up protocol (notifying
>> the master, which would notify the clients etc),  in other words it is
>> guaranteed that one tablet is served by only one tablet server at any given
>> time.  in comparison the above JIRA only TRYIES to serve that key from one
>> particular replica. HBase can have that guarantee because the group
>> membership is maintained by the strong failure detector.
>>
>>  just for hacking curiosity, a strong failure detector + Cassandra
>> replicas is not impossible (actually seems not difficult), although the
>> performance is not clear. what would such a strong failure detector bring to
>> Cassandra besides this ONE-ONE strong consistency ? that is an interesting
>> question I think.
>>
>>  considering that HBase has been deployed on big clusters, it is probably
>> OK with the performance of the strong  Zookeeper failure detector. then a
>> further question was: why did Dynamo originally choose to use the
>> probablistic failure detector? yes Dynamo's main theme is "eventually
>> consistent", so the Phi-detector is **enough**, but if a strong detector
>> buys us more with little cost, wouldn't that  be great?
>>
>>
>>
>> On Fri, Jul 1, 2011 at 6:53 PM, AJ <aj@dude.podzone.net> wrote:
>>
>>> Is this possible?
>>>
>>> All reads and writes for a given key will always go to the same node from
>>> a client.  It seems the only thing needed is to allow the clients to compute
>>> which node is the closes replica for the given key using the same algorithm
>>> C* uses.  When the first replica receives the write request, it will write
>>> to itself which should complete before any of the other replicas and then
>>> return.  The loads should still stay balanced if using random partitioner.
>>>  If the first replica becomes unavailable (however that is defined), then
>>> the clients can send to the next repilca in the ring and switch from ONE
>>> write/reads to QUORUM write/reads temporarily until the first replica
>>> becomes available again.  QUORUM is required since there could be some
>>> replicas that were not updated after the first replica went down.
>>>
>>> Will this work?  The goal is to have strong consistency with a read/write
>>> consistency level as low as possible while secondarily a network performance
>>> boost.
>>>
>>
>>
>>
>>
>
>
> --
> Will Oberman
> Civic Science, Inc.
> 3030 Penn Avenue., First Floor
> Pittsburgh, PA 15201
> (M) 412-480-7835
> (E) oberman@civicscience.com
>
>
>


-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) oberman@civicscience.com

Mime
View raw message