cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-4285) Atomic, eventually-consistent batches
Date Fri, 06 Jul 2012 10:28:35 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13407881#comment-13407881
] 

Sylvain Lebresne commented on CASSANDRA-4285:
---------------------------------------------

bq. Well, it's more complex than that.

I understand that but:
* with RF=1 we would still write to disk on only 1 node. So if some disk in the cluster has
any problem, then it's enough to have one other node going down (it doesn't have to be another
hardware failure, it could a simple OOM or anything really) to break the atomicity "guarantee".
Granted you have to be a bit unlucky, but the odds are far from unimaginable imo. And that
what guarantee are about, protecting you against being unlucky. I think RF=2 make this order
of magnitudes more secure, and if RF=2 had big drawbacks, then ok, why not consider RF=1 has
the default, but I don't think that's the case, quite the contrary even.
* as said in my previous comment, it's not only about durability. It's a latency issue. If
you do RF=1, then each time a node dies (is upgraded or whatnot) you *know* that some portion
of batchlog writes on the cluster will suffer from timeouts (even if we retry on the coordinator,
the latency will still suffer). That's actually the main reason why I think RF=2 is a much
much better default.
* I don't see much downside to RF=2 compared to RF=1. A little bit more network traffic and
more CPU usage maybe, but I think those are largely outweighed by the advantages.

Overall I do think quite strongly that RF=1 is the wrong default (and having it configurable
don't make it a better default). 

bq. "Failed writes leave the cluster in an unknown state" is the most frequent [legitimate]
complaint users have about Cassandra, and one that affects evaluations vs master-oriented
systems. We can try to educate about the difference between UE failure and TOE not-really-failure
until we are blue in the face but we will continue to get hammered for it.

Let's be clear that I completely agree with that. But fixing "Failed writes leave the cluster
in an unknown state" is fixed by fixing atomicity. And I'm all for fixing batch atomicity,
and I even think that for CQL3 we should make batch be atomic by default for all the reasons
you mentioned (which wouldn't exclude having some escape hatch like "BATCH ... APPLY WITHOUT
ATOMICITY GUARANTEE"). But whether we do coordinator-side retry is not directly related imho
(and so at best should be considered separatly).

To be precise, the DCL patch will add one more possibility for TOE compared to the current
write path, and that's a TOE while writting into the DCL. First, I think that using RF=2 will
largely mitigate the chance of getting that TOE in the first place as said above. But that
being said we could indeed retry another shard if we do still get a TOE I suppose. The only
thing that bothers me a bit is that I think it's useful that the timeout configured by the
client be an actual timeout on the server answer, even if to say that we haven't achieved
what asked in the time granted (and again, I'm all for returning more information on what
a TOE means exactly, i.e. CASSANDRA-4414, so that client may be able to judge whether what
we do have been able to achieve during that time is enough that he don't need to retry). However
I suppose one option could be to try the DCL write with a smaller timeout than the client
supplied one, so that we can do a retry while respecting the client timeo
 ut.

bq. Finally, once the batchlog write succeeds, we shouldn't have to make the client retry
for timeouts writing to the replicas either; we can do the retry server-side

My point is that retrying server-side in that case would be plain wrong. On the write path
(that's not true for read but that is a different subject), a timeout when writting to the
replicas means that the CL *cannot* be achieved at the current time (counter are another exception
of that, but they are a whole different problem). So retrying (client and server side for
that matter) with the same CL is useless and bad. The only thing that can be improved compared
to today is that we can say to the client that while the CL cannot be achieve we did persist
the write on some replica, which would remove the retry-with-smaller-CL-because-even-if-I-can't-get-my-CL-I-want-to-make-sure-the-write-is-at-least-persisted-on-some-replicas
most client probably do today. And that is really useful, but that is also a totally separate
issue to that ticket (namely CASSANDRA-4414) that don't only apply to batches nor only to
the atomic ones.

As a side note, I wouldn't be completely against discussing the possibility of doing some
coordinator-side retry for reads, but that's a different issue :)
                
> Atomic, eventually-consistent batches
> -------------------------------------
>
>                 Key: CASSANDRA-4285
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4285
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: API, Core
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>
> I discussed this in the context of triggers (CASSANDRA-1311) but it's useful as a standalone
feature as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message