cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jérôme Verstrynge <jvers...@gmail.com>
Subject Re: What happens if there is a collision?
Date Thu, 21 Oct 2010 23:41:22 GMT
On 21/10/2010 23:40, Peter Schuller wrote:
>> OK. Thanks for your answer. From an email exchange I had with Jonathan, all
>> this means that one should re-read its writes with quorum to make sure they
>> have not been overriden by timestamp-tie conflicts. I suggested to send
>> feedback to writting node (in the ACK) when such timestamps-tie conflict
>> happen. This would avoid having to double-check all writes for timestamp-tie
>> conflicts.
>>
>> If multiple applications write to the same ColumnFamily/Tables, this
>> double-check is a must (unless a separate locking mecanism is implemented,
>> which would be more heavy).
> I'm not sure I understand what you're trying to accomplish. Given that
> you have no locking/synchronization mechanism external to Cassandra,
> what is it that you are actually learning from re-reading the value? A
> completed write at level QUOROM means it was successfully written and
> that readers reading at QUOROM will see it unless the value has been
> updated subsequently.
REM: I am not trying to make this discussion longer than necessary or to 
play semantics. I am not in to that at all and I appreciate the time you 
take to answer me, really.

Here is where I disagree with your conclusion when there is a timestamp 
tie. The write by node E will not be performed successfully (at quorum 
level), because of the tie resolution in favor of A somewhere in all the 
nodes between A and E.

Let's imagine that A initiates its column write at: 334450 ms with 'AAA' 
and timestamp 334450 ms
Let's imagine that E initiates its column write at: 334451 ms with 
'ZZZ'and timestamp 334450 ms
(E is the latest write)

Let's imagine that A reaches C at 334455 ms and performs its write.
Let's imagine that E reaches C at 334456 ms and attempts to performs its 
write. It will loose the timestamp-tie ('AAA' is greater than 'ZZZ').

Even if there is no further writting on that same column using timestamp 
334450, a quorum read won't see that 'ZZZ' value (which is the latest 
attempt to write/update the column).

Node A will have completed a write a QUOROM level.
Node E will have completed a write a QUOROM level, but its value won't 
be registered and it won't be notified about it.

Hence, I disagree with your conclusion that a quorum write implies that 
it was successfully written. It is not the case for E. I know we could 
play semantics about the meaning of 'successful write' here, but that 
would not lead us nowhere and that is not my point.

> But even if you re-read, that does not remove
> the fundamental potential for a race condition (i.e., you still don't
> know when you see the result of your read whether it wasn't just
> ovewritten anyway just after you did your read).
>
> Perhaps I'm misunderstanding what you're trying to do?
I totally agree there is a risk of race condition.

Here is what I am trying to do and why:

If there is no timestamp-tie between A and E, then I have no issue.

If there is a timestamp-tie, then the context becomes uncertain for E, 
out of the blue.
If application E can't be sure about what has been saved in Cassandra, 
it cannot rely on what it has in memory. It is a vicious circle. It 
can't anticipate on the potential actions of A on the column too.
This is unsual for any application, but may be this is the price to pay 
for using Cassandra. Fair enough.

If E is not informed of the timestamp tie, then it is left alone in the 
dark. Hence, this is why I say Cassandra is not deterministic to E. The 
result of a write is potentially non-deterministic in what it actually 
performs.

If E was aware that it lost a timestamp-tie, it would know that there is 
a possible gap between its internal memory representation and what it 
tried to save into Cassandra. That is, EVEN if there is no further write 
on that same column (or, in other words, regardless of any potential 
subsequent races).

If E was informed it lost a timestamp-tie, it could re-read the column 
(and let's assume that there is no further write in between, but this 
does not change anything to the argument). It could spot that its write 
for timestamp value 334450 ms failed, and also the reason why ('AAA' 
greater than 'ZZZ). It could operate a new write, which eventually could 
result in another timestamp-tie, but at least it would be informed about 
it too... It would have a safety net.

The case I am trying to cover is the case where the context for 
application E becomes invalid because of a successful write call to 
Cassandra without registration of 'ZZZ'. How can Cassandra call it a 
successful write, when in fact, it isn't for application E? I believe 
Cassandra should notify application E one way or another. This is why I 
mentioned an extra timestamp-tie flag in the write ACK sent by nodes 
back to node E.

The subsequent question I have is:

If 'value breaks timestamp-tie', how does Cassandra behave in case of 
updates? If there is a column with value 'AAA' at 334450 ms and an 
application explicitely wants to update this value to 'ZZZ' for 334450 
ms, it seems like the timestamp-tie will prevent that. Hence, the 
update/mutation would be undeterministic to E. It seems like one should 
first delete the existing record and write a new one (and that could 
lead to race conditions and timestamp-ties too).

My conclusion so far is that a timestamp-tie boolean would help 
resolving potentially non-deterministic situations which can appear 
randomly at any time. Implementing locks would completely prevent these 
situations, but then, locks should be implemented for all writes on all 
tables if two application instance have access to it. It is a 
light/inexpensive versus heavy/costly safety net situation.

I think this should be documented, because engineers will hit that 
'local' undeterministic issue for sure if two instances of their 
applications perform 'completed writes' in the same column family. 
Completed does not mean successful, even with quorum (or ALL). They 
ought to know it.

Jérôme

Mime
View raw message