incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oleg Anastasyev <olega...@gmail.com>
Subject Re: New Chain for : Does Cassandra use vector clocks
Date Fri, 25 Feb 2011 09:22:30 GMT
Sylvain Lebresne <sylvain <at> datastax.com> writes:

> However, if that simple conflict detection/resolution mechanism is not good 
enough for some of your use case and you need to keep two concurrent updates, it 
is easy enough. Just make sure that the update don't end up in the same column. 
This is easily achieved by appending some unique identifier to the column name 
for instance. And when reading, do a slice and reconcile whatever you get back 
with whatever logic make sense. If you do that, congrats, you've roughly 
emulated what vector clocks would do. Btw, no locking or anything needed.

This solution is (much?) worse, than having vector clocks. It multiplies the 
amount of data and load to your system, forcing you to throw more nodes to the 
cluster, because:
* Number of columns at least doubles. Or even worse, if you cannot predict 
number of simultaneous processes accessing the same column, because you need 
then to add unique postfixes to columns of each of update, making them 
efficiently not updates, but inserts. If you have dataset, which updates often, 
you'll multiply number of columns and, so, the data size, by number of updates 
to your dataset. 
* These columns with uniq postfixes need to be merged somehow. Cassandra has 
nice background merge facility - named compaction - but it cannot work on such 
dataset, becase there is nothing to compact - every column is unique and has no 
overwritten generation.
* So, anyway, merge must be done - because logically this is still single 
column. And the only way is to read all columns with some prefix using get_slice 
call and resolve conflicts manually, returning freshest copy to client and 
deteling obsolete data. This makes app code complex, triggers additional load on 
cassandra cluster (it must do RR for several columns now instead of 1), triggers 
additional operations  (deletes of obsolete values).
* And finally, deleting obsolete data actually dont free space for GCPeriodTime. 
So your disks will be full, storing obsolete data for prolonged time.

In contrast, having vector clocks is more effective solution. It does not 
duplicates column names and values several times, it duplicates only timestamp 
by the number of your RF. And your logically single column is handled as single. 






Mime
View raw message