Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
Sender: scode@scode.org
In-Reply-To: <4CC0CFA2.8020901@gmail.com>
References: <4CBF99A8.7060304@dawningstreams.com>
	<AANLkTik1M1tcUGVY8ufAQUVb-pHUvGzNJWQN=uX=yqHF@mail.gmail.com>
	<4CBFB04E.6090406@gmail.com>
	<AANLkTi=C9SDT0LfvFwDQzCV4eGBjp8p95wObwzKpoWrw@mail.gmail.com>
	<4CC08D51.9080405@gmail.com>
	<AANLkTikRkn9gT+CF9VGNPRG+ia2LQXdAiYAmY7nVpN7D@mail.gmail.com>
	<4CC0CFA2.8020901@gmail.com>
Date: Tue, 26 Oct 2010 00:47:52 +0200
Message-ID: <AANLkTi=A=HtTWYqs-74JxtTERMvw1D8mKt+dh2WmGiKh@mail.gmail.com>
Subject: Re: What happens if there is a collision?
From: Peter Schuller <peter.schuller@infidyne.com>
To: =?UTF-8?Q?J=C3=A9r=C3=B4me_Verstrynge?= <jverstry@gmail.com>
Cc: user@cassandra.apache.org
Content-Type: text/plain; charset=UTF-8

(sorry about the delay in responding - inbox backlog)

> REM: I am not trying to make this discussion longer than necessary or to
> play semantics. I am not in to that at all and I appreciate the time you
> take to answer me, really.

No problem; and same here. I just think that a mutual understanding
tends to be beneficial both ways ;)

> Here is where I disagree with your conclusion when there is a timestamp tie.
> The write by node E will not be performed successfully (at quorum level),
> because of the tie resolution in favor of A somewhere in all the nodes
> between A and E.
>
> Let's imagine that A initiates its column write at: 334450 ms with 'AAA' and
> timestamp 334450 ms
> Let's imagine that E initiates its column write at: 334451 ms with 'ZZZ'and
> timestamp 334450 ms
> (E is the latest write)
>
> Let's imagine that A reaches C at 334455 ms and performs its write.
> Let's imagine that E reaches C at 334456 ms and attempts to performs its
> write. It will loose the timestamp-tie ('AAA' is greater than 'ZZZ').
>
> Even if there is no further writting on that same column using timestamp
> 334450, a quorum read won't see that 'ZZZ' value (which is the latest
> attempt to write/update the column).
>
> Node A will have completed a write a QUOROM level.
> Node E will have completed a write a QUOROM level, but its value won't be
> registered and it won't be notified about it.
>
> Hence, I disagree with your conclusion that a quorum write implies that it
> was successfully written. It is not the case for E. I know we could play
> semantics about the meaning of 'successful write' here, but that would not
> lead us nowhere and that is not my point.

It goes to the definition of 'written'. One possibly definition of
'written' may be that 'if a value is written, it will be seen by a
subsequent read assuming it was not already re-written'. One example
here unrelated to cassandra is a write() in POSIX; if you can prove a
write() happened (and completed) prior to a read() on the same file
say, you are supposed to be guaranteed that the read() will see your
write(). But this does not mean that one cannot submit additional
writes that will over-write the data.

In the case of Cassandra and quorom writes, a similar situation occurs.

Having written a column at QUOROM, you are guaranteed to be able to
read that value back (at QUOROM) at a later time provided that it was
not deleted or over-written in the mean time. None of the sequence
above seems to violate that.

You seem to be after the read seeing your write of 'ZZZ'. But under
what definition of 'written' do you expect this to happen in the face
of concurrent writers? There is never a guarantee that the entire
history of data ever written will be readable in the future; an
overwrite is still an overwrite. Even with something like a local disk
and fsync() in between each write, you have this problem in the
absence of synchronization of readers and writers.

This doesn't mean that your problem is somehow invalid; but it doesn't
sound like QUOROM consistency (over-writing) writes is the solution.

> Here is what I am trying to do and why:
>
> If there is no timestamp-tie between A and E, then I have no issue.
>
> If there is a timestamp-tie, then the context becomes uncertain for E, out
> of the blue.
> If application E can't be sure about what has been saved in Cassandra, it
> cannot rely on what it has in memory. It is a vicious circle. It can't
> anticipate on the potential actions of A on the column too.
> This is unsual for any application, but may be this is the price to pay for
> using Cassandra. Fair enough.

The problem here is - how would your application *ever* know without
synchronization? The situation should be the same even without a
timestamp tie. In either case, you're writing something to Cassandra
and you know there may be concurrent writers. When the write call
completes, you will never know whether the data you wrote is the
"current" value of the data.

That is, unless you *do* have some form of synchronization which
allows you to guarantee (and know that it is guaranteed) that there is
no timestamp tie, and that your application is informed of other
writes with newer timestamps. But if you have this, it sounds like you
already have a synchronization mechanism?

(Now; Cassandra could support some kind of pub/sub to allow you to be
notified of changes relative to your written data. It doesn't, at the
moment. But I don't think the current behavior is incorrect with
respect to QUOROM consistency.)

> If E is not informed of the timestamp tie, then it is left alone in the
> dark. Hence, this is why I say Cassandra is not deterministic to E. The
> result of a write is potentially non-deterministic in what it actually
> performs.

To re-phase myself a bit: I claim that the result of the write is
non-deterministic in the above sense *anyway*, unless you have a
strictly synchronized concept of monotonically increasing time and the
ability to ascertain the relative order of a write with respect to
other writes in the cluster.

If you *do* have this, then yes, given identical timestamps you have a
problem you would not have otherwise (say with infinite resolution
time). But if you have this level of synchronization, can you perhaps
guarantee that no two writers ever choose the same timestamp instead?

> If E was aware that it lost a timestamp-tie, it would know that there is a
> possible gap between its internal memory representation and what it tried to
> save into Cassandra. That is, EVEN if there is no further write on that same
> column (or, in other words, regardless of any potential subsequent races).
>
> If E was informed it lost a timestamp-tie, it could re-read the column (and
> let's assume that there is no further write in between, but this does not
> change anything to the argument). It could spot that its write for timestamp
> value 334450 ms failed, and also the reason why ('AAA' greater than 'ZZZ).
> It could operate a new write, which eventually could result in another
> timestamp-tie, but at least it would be informed about it too... It would
> have a safety net.

What is the difference, from your application's perspective, between
the timestamp tie and a write simply happening a millisecond later by
an un-coordinated concurrent writer? In both cases, the data in
cassandra will no longer match your client's view of it.

> The case I am trying to cover is the case where the context for application
> E becomes invalid because of a successful write call to Cassandra without
> registration of 'ZZZ'. How can Cassandra call it a successful write, when in
> fact, it isn't for application E? I believe Cassandra should notify
> application E one way or another. This is why I mentioned an extra
> timestamp-tie flag in the write ACK sent by nodes back to node E.

I'm repeating myself but just to be clear: So again, it seems to me
such an ACK would not be useful since you would not be made aware of
any change that happens later on anyway. It does not seem semantically
"relevant" except perhaps as a probabilistic optimization. As soon as
your write completes, you have no idea what is in Cassandra,
regardless of timestamp ties (assuming you have the potential for
concurrent writers).

> If 'value breaks timestamp-tie', how does Cassandra behave in case of
> updates? If there is a column with value 'AAA' at 334450 ms and an
> application explicitely wants to update this value to 'ZZZ' for 334450 ms,
> it seems like the timestamp-tie will prevent that. Hence, the
> update/mutation would be undeterministic to E. It seems like one should
> first delete the existing record and write a new one (and that could lead to
> race conditions and timestamp-ties too).

A single client wishing to make multiple logically subsequent writes
should ensure that the same timestamp is not used for such writes.

> I think this should be documented, because engineers will hit that 'local'
> undeterministic issue for sure if two instances of their applications
> perform 'completed writes' in the same column family. Completed does not
> mean successful, even with quorum (or ALL). They ought to know it.

I think it does. I believe the results you are describing as
unexpected are fully expected fundamentally, and there is no real
difference implied in receiving a timestamp ACK flag back. I'm totally
open to being wrong or having misunderstood something (or both), but
right now I don't see it. If on the other hand I'm not wrong then
perhaps we can figure out how to document or present the functionality
of Cassandra better :)

-- 
/ Peter Schuller