How would you deal with the problem when the 1st node responds
success but then crashes before completely forwarding any
replicas? Then, after switching to the next primary, a read
would return stale data.
Here's a quick-n-dirty way: Send the value to the primary
replica and send placeholder values to the other replicas.
The placeholder value is something like, "PENDING_UPDATE".
The placeholder values are sent with timestamps 1 less than
the timestamp for the actual value that went to the primary.
Later, when the changes propagate, the actual values will
overwrite the placeholders. In event of a crash before the
placeholder gets overwritten, the next read value will tell
the client so. The client will report to the user that the
key/column is unavailable. The downside is you've overwritten
your data and maybe would like to know what the old data was!
But, maybe there's another way using other columns or with
MVCC. The client would want a success from the primary and
the secondary replicas to be certain of future read
consistency in case the primary goes down immediately as I
said above. The ability to set an "update_pending" flag on
any column value would probably make this work. But, I'll
think more on this later.
On 7/2/2011 10:55 AM, Yang wrote:
there is a JIRA completed in 0.7.x that
"Prefers" a certain node in snitch, so this does roughly
what you want MOST of the time
but the problem is that it does not GUARANTEE that the
same node will always be read. I recently read into the
HBase vs Cassandra comparison thread that started after
Facebook dropped Cassandra for their messaging system, and
understood some of the differences. what you want is
essentially what HBase does. the fundamental difference
there is really due to the gossip protocol: it's a
probablistic, or eventually consistent failure detector
while HBase/Google Bigtable use Zookeeper/Chubby to
provide a strong failure detector (a distributed lock).
so in HBase, if a tablet server goes down, it really goes
down, it can not re-grab the tablet from the new tablet
server without going through a start up protocol
(notifying the master, which would notify the clients
etc), in other words it is guaranteed that one tablet is
served by only one tablet server at any given time. in
comparison the above JIRA only TRYIES to serve that key
from one particular replica. HBase can have that guarantee
because the group membership is maintained by the strong
just for hacking curiosity, a strong failure detector +
Cassandra replicas is not impossible (actually seems not
difficult), although the performance is not clear. what
would such a strong failure detector bring to Cassandra
besides this ONE-ONE strong consistency ? that is an
interesting question I think.
considering that HBase has been deployed on big
clusters, it is probably OK with the performance of the
strong Zookeeper failure detector. then a further
question was: why did Dynamo originally choose to use the
probablistic failure detector? yes Dynamo's main theme is
"eventually consistent", so the Phi-detector is
**enough**, but if a strong detector buys us more with
little cost, wouldn't that be great?
On Fri, Jul 1, 2011 at 6:53 PM,
Is this possible?
All reads and writes for a given key will always go to
the same node from a client. It seems the only thing
needed is to allow the clients to compute which node
is the closes replica for the given key using the same
algorithm C* uses. When the first replica receives
the write request, it will write to itself which
should complete before any of the other replicas and
then return. The loads should still stay balanced if
using random partitioner. If the first replica
becomes unavailable (however that is defined), then
the clients can send to the next repilca in the ring
and switch from ONE write/reads to QUORUM write/reads
temporarily until the first replica becomes available
again. QUORUM is required since there could be some
replicas that were not updated after the first replica
Will this work? The goal is to have strong
consistency with a read/write consistency level as low
as possible while secondarily a network performance