incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Schuller <peter.schul...@infidyne.com>
Subject Re: What happens if there is a collision?
Date Tue, 26 Oct 2010 19:17:45 GMT
> I may have been unclear about the meaning of timestamp in Cassandra. I was
> under the impression that any given data with the same key value and two
> different timestamps would result in two 'rows'. From what you say, it does
> not seem to be the case. Do you confirm? (In other words, whoever has the
> greatest timestamp destroys the previous records with lower timestamps).

Yes (other than the use of the word "row"). An "insert" of a column (a
column being essentially a key/value pair) causes the key to be
associated with that value. If there was already a column with the
same key, it is replaced. If not, a column is added.

If you have a situation where conflicting writes cannot be allowed,
you'll either have to have some strong co-ordination of writers
outside of Cassandra or else "serialize" the problem by writing
intended changes to some kind of queue/data structure that some
particular guaranteed-to-be-alone Cassandra client processes in batch
mode independently (thereby avoiding the need for co-ordination).

> I know I am boxing a corner case, but I have not seen in the documentation
> that latest timestamp erases/overwrittes previous data. Now, I may have
> missed something here. May be I did not rub my eyes enough or the coffee was
> not operating yet.

I'm not sure where it's most clearly stated and I don't remember how I
figured these things out originally. I think the closest thing on the
wiki would be:

  http://wiki.apache.org/cassandra/DataModel

It does mention that timestamps are used for conflict resolution but
does not really dwell on the issue, and the remainder elides
timestamps. So perhaps it's easy to miss. I also notice that the
phrasing is such that it is not entirely unreasonably to interpret it
like it seems you have.

At the same time that page is somewhat of a mix between internal
models and the model exposed to clients, so I'm not sure how best to
improve the phrasing.

Riptano's recently added documentation may be worth reading:

   http://www.riptano.com/docs/0.6.5/index

Though upon cursory examination I'm not sure whether it is more clear
on this particular point.

> i) That most recent timestamp overwrittes previous entries with lower
> timestamp.

This can definitely be clarified.

> ii) If case of timestamp ties, value breaks ties.

If this is indeed intended to be a guarantee and not an artifact of
the current implementation (anyone want to comment - jbellis?).

> iii) What about ColumnFamilies and SuperColumnFamilies? Do we have the
> guarantee that, in case of timestamp  ties, the whole record of the winner
> is register (I would assume yes, of course)

Individual columns may be inserted into a SuperColumn so it is not
inserted as one compound value. If writers A and B both do concurrent
insertions to a SuperColumn where A writes column C1 and B writes
column C1 and C2, B's write of C2 will always stick, but C1 will be
subject to individual column conflict resolution. Keep in mind however
that typically timestamps are not allocated/chosen on a per-column
basis by a client. It does occur to me that at this point you may
actually have issues with the timestamp tie and value based conflict
resolution if you are expecting a set of column updates to either
apply or not apply as a group (with respect to some other group of
updates). That's a bit subtle.

Also on the topic of granularity, entire super columns and entire rows
may be deleted without individually referring to all columns. In those
cases, deletes span entire rows or supercolumns rather than individual
columns.

-- 
/ Peter Schuller

Mime
View raw message