Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Fri, 6 Jul 2012 10:28:35 +0000 (UTC)
From: "Sylvain Lebresne (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <3065408.13500.1341570515227.JavaMail.jiratomcat@issues-vm>
In-Reply-To: <736030975.3005.1337965883195.JavaMail.jiratomcat@issues-vm>
Subject: [jira] [Commented] (CASSANDRA-4285) Atomic, eventually-consistent
 batches
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/CASSANDRA-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13407881#comment-13407881 ] 

Sylvain Lebresne commented on CASSANDRA-4285:
---------------------------------------------

bq. Well, it's more complex than that.

I understand that but:
* with RF=1 we would still write to disk on only 1 node. So if some disk in the cluster has any problem, then it's enough to have one other node going down (it doesn't have to be another hardware failure, it could a simple OOM or anything really) to break the atomicity "guarantee". Granted you have to be a bit unlucky, but the odds are far from unimaginable imo. And that what guarantee are about, protecting you against being unlucky. I think RF=2 make this order of magnitudes more secure, and if RF=2 had big drawbacks, then ok, why not consider RF=1 has the default, but I don't think that's the case, quite the contrary even.
* as said in my previous comment, it's not only about durability. It's a latency issue. If you do RF=1, then each time a node dies (is upgraded or whatnot) you *know* that some portion of batchlog writes on the cluster will suffer from timeouts (even if we retry on the coordinator, the latency will still suffer). That's actually the main reason why I think RF=2 is a much much better default.
* I don't see much downside to RF=2 compared to RF=1. A little bit more network traffic and more CPU usage maybe, but I think those are largely outweighed by the advantages.

Overall I do think quite strongly that RF=1 is the wrong default (and having it configurable don't make it a better default). 

bq. "Failed writes leave the cluster in an unknown state" is the most frequent [legitimate] complaint users have about Cassandra, and one that affects evaluations vs master-oriented systems. We can try to educate about the difference between UE failure and TOE not-really-failure until we are blue in the face but we will continue to get hammered for it.

Let's be clear that I completely agree with that. But fixing "Failed writes leave the cluster in an unknown state" is fixed by fixing atomicity. And I'm all for fixing batch atomicity, and I even think that for CQL3 we should make batch be atomic by default for all the reasons you mentioned (which wouldn't exclude having some escape hatch like "BATCH ... APPLY WITHOUT ATOMICITY GUARANTEE"). But whether we do coordinator-side retry is not directly related imho (and so at best should be considered separatly).

To be precise, the DCL patch will add one more possibility for TOE compared to the current write path, and that's a TOE while writting into the DCL. First, I think that using RF=2 will largely mitigate the chance of getting that TOE in the first place as said above. But that being said we could indeed retry another shard if we do still get a TOE I suppose. The only thing that bothers me a bit is that I think it's useful that the timeout configured by the client be an actual timeout on the server answer, even if to say that we haven't achieved what asked in the time granted (and again, I'm all for returning more information on what a TOE means exactly, i.e. CASSANDRA-4414, so that client may be able to judge whether what we do have been able to achieve during that time is enough that he don't need to retry). However I suppose one option could be to try the DCL write with a smaller timeout than the client supplied one, so that we can do a retry while respecting the client timeo
 ut.

bq. Finally, once the batchlog write succeeds, we shouldn't have to make the client retry for timeouts writing to the replicas either; we can do the retry server-side

My point is that retrying server-side in that case would be plain wrong. On the write path (that's not true for read but that is a different subject), a timeout when writting to the replicas means that the CL *cannot* be achieved at the current time (counter are another exception of that, but they are a whole different problem). So retrying (client and server side for that matter) with the same CL is useless and bad. The only thing that can be improved compared to today is that we can say to the client that while the CL cannot be achieve we did persist the write on some replica, which would remove the retry-with-smaller-CL-because-even-if-I-can't-get-my-CL-I-want-to-make-sure-the-write-is-at-least-persisted-on-some-replicas most client probably do today. And that is really useful, but that is also a totally separate issue to that ticket (namely CASSANDRA-4414) that don't only apply to batches nor only to the atomic ones.

As a side note, I wouldn't be completely against discussing the possibility of doing some coordinator-side retry for reads, but that's a different issue :)
                
> Atomic, eventually-consistent batches
> -------------------------------------
>
>                 Key: CASSANDRA-4285
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4285
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: API, Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>
> I discussed this in the context of triggers (CASSANDRA-1311) but it's useful as a standalone feature as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira