Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 75385 invoked from network); 25 Oct 2010 22:48:29 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 25 Oct 2010 22:48:29 -0000 Received: (qmail 90310 invoked by uid 500); 25 Oct 2010 22:48:27 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 90279 invoked by uid 500); 25 Oct 2010 22:48:27 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 90270 invoked by uid 99); 25 Oct 2010 22:48:27 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Oct 2010 22:48:27 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.160.172] (HELO mail-gy0-f172.google.com) (209.85.160.172) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Oct 2010 22:48:15 +0000 Received: by gya6 with SMTP id 6so2757265gya.31 for ; Mon, 25 Oct 2010 15:47:52 -0700 (PDT) MIME-Version: 1.0 Received: by 10.150.51.18 with SMTP id y18mr13894156yby.191.1288046872838; Mon, 25 Oct 2010 15:47:52 -0700 (PDT) Sender: scode@scode.org Received: by 10.151.100.17 with HTTP; Mon, 25 Oct 2010 15:47:52 -0700 (PDT) X-Originating-IP: [213.114.156.79] In-Reply-To: <4CC0CFA2.8020901@gmail.com> References: <4CBF99A8.7060304@dawningstreams.com> <4CBFB04E.6090406@gmail.com> <4CC08D51.9080405@gmail.com> <4CC0CFA2.8020901@gmail.com> Date: Tue, 26 Oct 2010 00:47:52 +0200 X-Google-Sender-Auth: w1eUuofLo0IC29ih5dSoNTfvaig Message-ID: Subject: Re: What happens if there is a collision? From: Peter Schuller To: =?UTF-8?Q?J=C3=A9r=C3=B4me_Verstrynge?= Cc: user@cassandra.apache.org Content-Type: text/plain; charset=UTF-8 X-Virus-Checked: Checked by ClamAV on apache.org (sorry about the delay in responding - inbox backlog) > REM: I am not trying to make this discussion longer than necessary or to > play semantics. I am not in to that at all and I appreciate the time you > take to answer me, really. No problem; and same here. I just think that a mutual understanding tends to be beneficial both ways ;) > Here is where I disagree with your conclusion when there is a timestamp tie. > The write by node E will not be performed successfully (at quorum level), > because of the tie resolution in favor of A somewhere in all the nodes > between A and E. > > Let's imagine that A initiates its column write at: 334450 ms with 'AAA' and > timestamp 334450 ms > Let's imagine that E initiates its column write at: 334451 ms with 'ZZZ'and > timestamp 334450 ms > (E is the latest write) > > Let's imagine that A reaches C at 334455 ms and performs its write. > Let's imagine that E reaches C at 334456 ms and attempts to performs its > write. It will loose the timestamp-tie ('AAA' is greater than 'ZZZ'). > > Even if there is no further writting on that same column using timestamp > 334450, a quorum read won't see that 'ZZZ' value (which is the latest > attempt to write/update the column). > > Node A will have completed a write a QUOROM level. > Node E will have completed a write a QUOROM level, but its value won't be > registered and it won't be notified about it. > > Hence, I disagree with your conclusion that a quorum write implies that it > was successfully written. It is not the case for E. I know we could play > semantics about the meaning of 'successful write' here, but that would not > lead us nowhere and that is not my point. It goes to the definition of 'written'. One possibly definition of 'written' may be that 'if a value is written, it will be seen by a subsequent read assuming it was not already re-written'. One example here unrelated to cassandra is a write() in POSIX; if you can prove a write() happened (and completed) prior to a read() on the same file say, you are supposed to be guaranteed that the read() will see your write(). But this does not mean that one cannot submit additional writes that will over-write the data. In the case of Cassandra and quorom writes, a similar situation occurs. Having written a column at QUOROM, you are guaranteed to be able to read that value back (at QUOROM) at a later time provided that it was not deleted or over-written in the mean time. None of the sequence above seems to violate that. You seem to be after the read seeing your write of 'ZZZ'. But under what definition of 'written' do you expect this to happen in the face of concurrent writers? There is never a guarantee that the entire history of data ever written will be readable in the future; an overwrite is still an overwrite. Even with something like a local disk and fsync() in between each write, you have this problem in the absence of synchronization of readers and writers. This doesn't mean that your problem is somehow invalid; but it doesn't sound like QUOROM consistency (over-writing) writes is the solution. > Here is what I am trying to do and why: > > If there is no timestamp-tie between A and E, then I have no issue. > > If there is a timestamp-tie, then the context becomes uncertain for E, out > of the blue. > If application E can't be sure about what has been saved in Cassandra, it > cannot rely on what it has in memory. It is a vicious circle. It can't > anticipate on the potential actions of A on the column too. > This is unsual for any application, but may be this is the price to pay for > using Cassandra. Fair enough. The problem here is - how would your application *ever* know without synchronization? The situation should be the same even without a timestamp tie. In either case, you're writing something to Cassandra and you know there may be concurrent writers. When the write call completes, you will never know whether the data you wrote is the "current" value of the data. That is, unless you *do* have some form of synchronization which allows you to guarantee (and know that it is guaranteed) that there is no timestamp tie, and that your application is informed of other writes with newer timestamps. But if you have this, it sounds like you already have a synchronization mechanism? (Now; Cassandra could support some kind of pub/sub to allow you to be notified of changes relative to your written data. It doesn't, at the moment. But I don't think the current behavior is incorrect with respect to QUOROM consistency.) > If E is not informed of the timestamp tie, then it is left alone in the > dark. Hence, this is why I say Cassandra is not deterministic to E. The > result of a write is potentially non-deterministic in what it actually > performs. To re-phase myself a bit: I claim that the result of the write is non-deterministic in the above sense *anyway*, unless you have a strictly synchronized concept of monotonically increasing time and the ability to ascertain the relative order of a write with respect to other writes in the cluster. If you *do* have this, then yes, given identical timestamps you have a problem you would not have otherwise (say with infinite resolution time). But if you have this level of synchronization, can you perhaps guarantee that no two writers ever choose the same timestamp instead? > If E was aware that it lost a timestamp-tie, it would know that there is a > possible gap between its internal memory representation and what it tried to > save into Cassandra. That is, EVEN if there is no further write on that same > column (or, in other words, regardless of any potential subsequent races). > > If E was informed it lost a timestamp-tie, it could re-read the column (and > let's assume that there is no further write in between, but this does not > change anything to the argument). It could spot that its write for timestamp > value 334450 ms failed, and also the reason why ('AAA' greater than 'ZZZ). > It could operate a new write, which eventually could result in another > timestamp-tie, but at least it would be informed about it too... It would > have a safety net. What is the difference, from your application's perspective, between the timestamp tie and a write simply happening a millisecond later by an un-coordinated concurrent writer? In both cases, the data in cassandra will no longer match your client's view of it. > The case I am trying to cover is the case where the context for application > E becomes invalid because of a successful write call to Cassandra without > registration of 'ZZZ'. How can Cassandra call it a successful write, when in > fact, it isn't for application E? I believe Cassandra should notify > application E one way or another. This is why I mentioned an extra > timestamp-tie flag in the write ACK sent by nodes back to node E. I'm repeating myself but just to be clear: So again, it seems to me such an ACK would not be useful since you would not be made aware of any change that happens later on anyway. It does not seem semantically "relevant" except perhaps as a probabilistic optimization. As soon as your write completes, you have no idea what is in Cassandra, regardless of timestamp ties (assuming you have the potential for concurrent writers). > If 'value breaks timestamp-tie', how does Cassandra behave in case of > updates? If there is a column with value 'AAA' at 334450 ms and an > application explicitely wants to update this value to 'ZZZ' for 334450 ms, > it seems like the timestamp-tie will prevent that. Hence, the > update/mutation would be undeterministic to E. It seems like one should > first delete the existing record and write a new one (and that could lead to > race conditions and timestamp-ties too). A single client wishing to make multiple logically subsequent writes should ensure that the same timestamp is not used for such writes. > I think this should be documented, because engineers will hit that 'local' > undeterministic issue for sure if two instances of their applications > perform 'completed writes' in the same column family. Completed does not > mean successful, even with quorum (or ALL). They ought to know it. I think it does. I believe the results you are describing as unexpected are fully expected fundamentally, and there is no real difference implied in receiving a timestamp ACK flag back. I'm totally open to being wrong or having misunderstood something (or both), but right now I don't see it. If on the other hand I'm not wrong then perhaps we can figure out how to document or present the functionality of Cassandra better :) -- / Peter Schuller