cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-8984) Introduce Transactional API for behaviours that can corrupt system state
Date Mon, 23 Mar 2015 22:43:54 GMT


Benedict commented on CASSANDRA-8984:

There is unfortunately a lot of boilerplate, because all but one class that implements this
behaviour already has a superclass - otherwise i'd have just provided an abstract class and
we could have avoided all of the code duplication. I thought it was better to have an internal
state class that is shared by all, though, so that the semantics are definitely the same for
each object - copy/pasting boilerplate seems much less of a problem to me than duplicating
even relatively simple internal state management. The idea of this interface is that multiple
such implementors can safely be composed without understanding exactly what each of them is
or does. This means we can just ensure every class follows the strict guidelines of the interface,
and every change to a class that uses it can stop worrying about all of the complex state
interactions with each of the related classes, and just ensure they use the interface correctly,
and implement it correctly themselves. In this sense, having the no-op transitions and boilerplate
that just ensure correct usage are IMO very useful. 

bq. I'm not too keen on prepareToCommit being part of the usage pattern without actually being
present in the interface but I can see why you went that route. I don't have any alternative
suggestions unfortunately.

I started with this part of the interface, but it really didn't work. I'm open to suggestions,
but currently this is the best setup AFAICT.

bq. The passing around of Throwables to merge deep in the stack (Transactional, SafeMemory*,
Ref* etc) is a little cludgy and has some pretty deeply nested tag-along variables. Not sure
why we can't just return a Throwable up that stack and merge at the top level when we know
we might need to merge rather than passing the Throwable all the way down from the top to
merge at the bottom...?

This is something I could live with either way, but the thing I like about this approach is
it makes it very clear that the correct usage requires ensuring no Throwables are thrown,
nor are they not returned (you pass one in, and get one back). It makes it obvious to a user
of the API that, if they're supposed to pass one in, they must have one floating around. If
we simply return a Throwable, I think it will lead to more incidences of incorrect usage.
I also think it leads to neater implementations, as at the most difficult parts (i.e. where
we're stitching multiple different transactional behaviours together) we're only making declarative
statements, and not littering them with lots of merge() statements - these only ever occur
inside the catch clauses we already have to have. TL; DR: IMO, it's both easier to read, and
harder to misuse.

bq. As we're not attempting to address any current pain-point with this ticket

My concern is that we don't (nor have we ever) deal with rollback safely. I can prove we don't
very simply with CASSANDRA-8568, which actually partially spawned this parallel but more constrained
patch: I fault injection tested small portions of these parts of the code, and they failed
(but the bits I'd hardened in -8568 did not). So if exceptions that can occur under normal
operation do so, we will result in inconsistent internal state (corrupted data tracker being
the worst possible outcome, but more mundane things like dangling files are a bigger problem).
Much of the problems we've seen have been down to this unwinding of state. That said, I completely
have sympathy with the view this is too much for 2.1. I won't push hard for its inclusion,
as it may not be necessary. But we _are_ definitely broken in 2.1 for dealing with failures
in the middle of state changes, especially if those failures happen during _rollback_ of a
change, and it's not clear what the system state will be after this. We've had some programmer
induced pain recently which increased the incidence of these problems. Hopefully those have
gone, but the risk is there either way.

bq. Moving the state tracking logic from within the SSTRW and SSTW into their own abstraction

As far as this patch is concerned the only new state object is the Transactional.StateManager,
which is just a tool for making boilerplate easier to spot, and easier to ensure are all behaving
in the same way. Other than this all of the state is still managed by the same objects? I
agree CASSANDRA-8568 does this, but this is really about pulling a hundred different modifying
points and controlling them in one place, so that again the semantics are clear.

bq. is increasing our net complexity 

On the whole my current raft of changes is all about trying to make semantics clear and well
defined, which in my opinion reduces complexity, significantly, even if it increases LOC.
The idea is to reduce the amount of special casing, and of subtly differing behaviours between
similar classes. Here we had a half dozen classes each doing the same basic thing, but each
doing it in their own different way, and we would get problems with code complexity and mistakes
when any two interfaced in a way that was discordant. I'm quite convinced this approach is
_simpler_ than what it replaces, and definitely less error prone. The main difference is that
it was broken before, and we didn't care so much, so the real complexity was disguised by
a perceived simplicity. I'm hoping this reverses that, although if we can get perceived simplicity
as well, that would be great.

I don't pretend it's definitely the best design it can be for consumption though, so hopefully
we can work towards making it more obvious so that any outwardly perceived complexity is mitigated.

I will take a look at the timeouts tomorrow - thanks.

> Introduce Transactional API for behaviours that can corrupt system state
> ------------------------------------------------------------------------
>                 Key: CASSANDRA-8984
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benedict
>            Assignee: Benedict
>             Fix For: 2.1.4
>         Attachments: 8984_windows_timeout.txt
> As a penultimate (and probably final for 2.1, if we agree to introduce it there) round
of changes to the internals managing sstable writing, I've introduced a new API called "Transactional"
that I hope will make it much easier to write correct behaviour. As things stand we conflate
a lot of behaviours into methods like "close" - the recent changes unpicked some of these,
but didn't go far enough. My proposal here introduces an interface designed to support four
actions (on top of their normal function):
> * prepareToCommit
> * commit
> * abort
> * cleanup
> In normal operation, once we have finished constructing a state change we call prepareToCommit;
once all such state changes are prepared, we call commit. If at any point everything fails,
abort is called. In _either_ case, cleanup is called at the very last.
> These transactional objects are all AutoCloseable, with the behaviour being to rollback
any changes unless commit has completed successfully.
> The changes are actually less invasive than it might sound, since we did recently introduce
abort in some places, as well as have commit like methods. This simply formalises the behaviour,
and makes it consistent between all objects that interact in this way. Much of the code change
is boilerplate, such as moving an object into a try-declaration, although the change is still
non-trivial. What it _does_ do is eliminate a _lot_ of special casing that we have had since
2.1 was released. The data tracker API changes and compaction leftover cleanups should finish
the job with making this much easier to reason about, but this change I think is worthwhile
considering for 2.1, since we've just overhauled this entire area (and not released these
changes), and this change is essentially just the finishing touches, so the risk is minimal
and the potential gains reasonably significant.

This message was sent by Atlassian JIRA

View raw message