Why not send the value itself instead of a placeholder? Now it takes
2x writes on a random node to do a single update (write placeholder,
write update) and N*x writes from the client (write value, write
placeholder to N-1). Where N is replication factor. Seems like extra
network and IO instead of less...
Of course, I still think this sounds like reimplementing Cassandra
internals in a Cassandra client (just guessing, I'm not a cassandra dev)
On Jul 3, 2011, at 5:20 PM, AJ <aj@dude.podzone.net> wrote:
> Yang,
>
> How would you deal with the problem when the 1st node responds
> success but then crashes before completely forwarding any replicas?
> Then, after switching to the next primary, a read would return stale
> data.
>
> Here's a quick-n-dirty way: Send the value to the primary replica
> and send placeholder values to the other replicas. The placeholder
> value is something like, "PENDING_UPDATE". The placeholder values
> are sent with timestamps 1 less than the timestamp for the actual
> value that went to the primary. Later, when the changes propagate,
> the actual values will overwrite the placeholders. In event of a
> crash before the placeholder gets overwritten, the next read value
> will tell the client so. The client will report to the user that
> the key/column is unavailable. The downside is you've overwritten
> your data and maybe would like to know what the old data was! But,
> maybe there's another way using other columns or with MVCC. The
> client would want a success from the primary and the secondary
> replicas to be certain of future read consistency in case the
> primary goes down immediately as I said above. The ability to set
> an "update_pending" flag on any column value would probably make
> this work. But, I'll think more on this later.
>
> aj
>
> On 7/2/2011 10:55 AM, Yang wrote:
>>
>> there is a JIRA completed in 0.7.x that "Prefers" a certain node in
>> snitch, so this does roughly what you want MOST of the time
>>
>>
>> but the problem is that it does not GUARANTEE that the same node
>> will always be read. I recently read into the HBase vs Cassandra
>> comparison thread that started after Facebook dropped Cassandra for
>> their messaging system, and understood some of the differences.
>> what you want is essentially what HBase does. the fundamental
>> difference there is really due to the gossip protocol: it's a
>> probablistic, or eventually consistent failure detector while
>> HBase/Google Bigtable use Zookeeper/Chubby to provide a strong
>> failure detector (a distributed lock). so in HBase, if a tablet
>> server goes down, it really goes down, it can not re-grab the
>> tablet from the new tablet server without going through a start up
>> protocol (notifying the master, which would notify the clients
>> etc), in other words it is guaranteed that one tablet is served by
>> only one tablet server at any given time. in comparison the above
>> JIRA only TRYIES to serve that key from one particular replica.
>> HBase can have that guarantee because the group membership is
>> maintained by the strong failure detector.
>>
>> just for hacking curiosity, a strong failure detector + Cassandra
>> replicas is not impossible (actually seems not difficult), although
>> the performance is not clear. what would such a strong failure
>> detector bring to Cassandra besides this ONE-ONE strong
>> consistency ? that is an interesting question I think.
>>
>> considering that HBase has been deployed on big clusters, it is
>> probably OK with the performance of the strong Zookeeper failure
>> detector. then a further question was: why did Dynamo originally
>> choose to use the probablistic failure detector? yes Dynamo's main
>> theme is "eventually consistent", so the Phi-detector is
>> **enough**, but if a strong detector buys us more with little cost,
>> wouldn't that be great?
>>
>>
>>
>> On Fri, Jul 1, 2011 at 6:53 PM, AJ <aj@dude.podzone.net> wrote:
>> Is this possible?
>>
>> All reads and writes for a given key will always go to the same
>> node from a client. It seems the only thing needed is to allow the
>> clients to compute which node is the closes replica for the given
>> key using the same algorithm C* uses. When the first replica
>> receives the write request, it will write to itself which should
>> complete before any of the other replicas and then return. The
>> loads should still stay balanced if using random partitioner. If
>> the first replica becomes unavailable (however that is defined),
>> then the clients can send to the next repilca in the
>> ring and switch from ONE write/reads to QUORUM write/reads
>> temporarily until the first replica becomes available again.
>> QUORUM is required since there could be some replicas that were not
>> updated after the first replica went down.
>>
>> Will this work? The goal is to have strong consistency with a read/
>> write consistency level as low as possible while secondarily a
>> network performance boost.
>>
>
|