cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-6477) Materialized Views (was: Global Indexes)
Date Tue, 21 Jul 2015 10:47:09 GMT


Benedict commented on CASSANDRA-6477:

So, I've been mulling on this, and I think we can safely guarantee eventual consistency even
without a write-path repair. The question is only timeliness, however this can be problematic
with or without write-path repair, since DELETEs and UPDATEs for the same operation are not
replicated in a coordinated fashion.

h3. Timeliness / Consistency Caveats
If, for instance, three different updates are sent at once, and each base-replica receives
a different update, each may be propagated onto two different nodes via repair, giving 9 nodes
with live data. Eventually one more base-replica receives each of the original updates, and
two issue deletes for their data. So we have 7 nodes with live data, 2 with tombstones. The
batchlog is now happy, however the system is in an inconsistent state until either the MV
replicas are repaired again or - with write-path-repair - the base-replicas are. 

Without write-path repair we may be more prone to this problem, but I don't think dramatically
more, although I haven't the time to think it through exhaustively. AFAICT, repair must necessarily
be able to introduce inconsistency that can only be resolved by another repair (which itself
can, of course, introduce more inconsistency).

I'm pretty sure there are a multiplicity of similar scenarios, and certainly there are less
extreme scenarios. Two competing updates and one repair are enough, so long as it's the "wrong"
update repaired, and to the "wrong" MV replica it's repaired to.

h3. Correctness Caveats
There are also other problems unique to MV: if we lose any two base-replicas (which with vnodes
means any two nodes), we can be left with ghost records that are *never* purged. So any concurrent
loss of two nodes means we really need to truncate the MV and rebuild, or we need a special
truncation record to truncate only the portion that was owned by those two nodes. This happens
for any updates that were received only by those two nodes, but were then proxied on (or written
to their batchlogs). This can of course affect any normal QUORUM updates to the cluster, the
difference being that the user has no control over these, and simply resending your update
to the cluster does not resolve the problem as it would in the current world. Users performing
updates to single partitions that would have never been affected by this now also have this
to worry about.

h3. Confidence
Certainly, I think we need to a *bit* of formal analysis or simulation of what the possible
cluster states are. Ideally a simple model of how each piece of infrastructure works could
be constructed in a single process to run the equivalent of years of operations a normal cluster
would execute, to explore the levels of "badness" we can expect. That's just my opinion, but
I think it would be invaluable, because after spending some spare time thinking about these
problems, I think it is a _very hard thing to do_, and I would rather not trust our feelings
about correctness.

h3. Multiple Columns
As far as multiple columns are concerned: I think we may need to go back to the drawing board
there. It's actually really easy to demonstrate the cluster getting into broken states. Say
you have three columns, A B C, and you send three competing updates a b c to their respective
columns; previously all held the value _. If they arrive in different orders on each base-replica
we can end up with 6 different MV states around the cluster. If any base replica dies, you
don't know which of those 6 intermediate states were taken (and probably replicated) by its
MV replicas. This problem grows exponentially as you add "competing" updates (which, given
split brain, can compete over arbitrarily long intervals).

This is where my concern about a "single (base) node" dependency comes in, but after consideration
it's clear that with a single column this problem is avoided because it's never ambiguous
what the _old_ state was. If you encounter a mutation that is shadowed by your current data,
you can always issue a delete for the correct prior state. With multiple columns that is no
longer possible.

I'm pretty sure the presence of multiple columns introduces other issues with each of the
other moving parts.

h3. Important Implementation Detail
Had a quick browse, and I found that writeOrder is now wrapping multiple synchronous network
operations. OpOrders are only intended to wrap local operations, so this should be rejigged
to avoid locking up the cluster.

h3. TL;DR
Anyway, hopefully that wall of text isn't too unappetizing, and is somewhat helpful. To summarize
my thoughts, I think the following things are worthy of due consideration and potentially
highlighting to the user:

* Availability:
** RF-1 node failures cause parts of the MV to receive NO updates, and remain incapable of
responding correctly to any queries on their portion (this will apply unevenly for a given
base update, meaning multiple generations of value could persist concurrently in the cluster)
** Any node loss results in a significantly larger hit to the consistency of the MV (in my
example, 20% loss of QUORUM to cluster resulted in 57% loss to MV)
** Both of these are potentially avoidable by ensuring we try another node if "ours" is down,
but due consideration needs to be given to if this potentially results in more cluster inconsistencies
* Repair seems to require possibility of introducing inconsistent cluster states that can
only be repaired by repair (which introduces more such states at the same time), resulting
in potentially lengthy inconsistencies, or repair frequency greater than can operationally
be managed rught now
* Loss of any two nodes in a vnode cluster can result in permanent inconsistency
* Have we spotted all of the caveats?
* Rejig writeOrder usage
* Multiple columns need a lot more thought

> Materialized Views (was: Global Indexes)
> ----------------------------------------
>                 Key: CASSANDRA-6477
>                 URL:
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: API, Core
>            Reporter: Jonathan Ellis
>            Assignee: Carl Yeksigian
>              Labels: cql
>             Fix For: 3.0 beta 1
>         Attachments:, users.yaml
> Local indexes are suitable for low-cardinality data, where spreading the index across
the cluster is a Good Thing.  However, for high-cardinality data, local indexes require querying
most nodes in the cluster even if only a handful of rows is returned.

This message was sent by Atlassian JIRA

View raw message