cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-8272) 2ndary indexes can return stale data
Date Fri, 12 May 2017 09:57:04 GMT


Sylvain Lebresne commented on CASSANDRA-8272:

I hadn't really looked at the patch before, but I think we can and should make most things
index agnostic here. In particular, I don't think we should use {{Index#postProcessorFor}}
for the coordinator-side filtering. What I would do instead is modify {{PartitionRangeReadCommand#ReconcialiationProcessing}}
to be:
public PartitionIterator postReconciliationProcessing(PartitionIterator result)
    ColumnFamilyStore cfs =;
    Index index = getIndex(cfs);
    if (index == null)
        return result;

    // Indexes on replica can return results that don't match the query but are necessary
    // to avoid stale entries from other nodes (see #8272) so we should filter those out
    // now. Then we apply any index specific post-processor.
    RowFilter indexFilter = index.getIndexQueryFilter(rowFilter());
    return index.postProcessorFor(this)
                .apply(indexFilter.filter(result, metadata(), nowInSec()), this);
where we'd just need to add a {{Index#getIndexQueryFilter}} that would be the exact inverse
of {{Index#getPostIndexQueryFilter}} (and so this can be simply implemented as a default method
that subtract from its input anything from {{Index#getPostIndexQueryFilter}}).

The reason I'm advocating this is two-fold:
# it's simpler in that it's done once generically instead of being re-implemented by each
index implementation; and as index already expose everything we need to do this, no reason
to have them do it "manually".
# probably more importantly, if we do decide to move all filtering coordinator-side in CASSANDRA-8273
(which I do believe is the right thing to do), we'd just have to modify (and actually simplify)
that method slightly to
public PartitionIterator postReconciliationProcessing(PartitionIterator result)
    result = rowFilter().filter(result, metadata(), nowInSec());
    ColumnFamilyStore cfs =;
    Index index = getIndex(cfs);
    if (index == null) ? result : index.postProcessorFor(this).apply(result, this);
which is imo kind of neat conceptually.

It is true that doing so does mean we need indexes to implement {{CustomExpression#isSatisfiedBy}},
which as I mention in my previous comment requires a slight refactor, but I'd argue that this
is something we absolutely should do anyway (and should have done before). It's wrong that
this method is currently basically broken for custom expressions: sure we currently happen
to not use it in that case, but it's not future-proof at all and it's that kind of tribal
knowledge ("you should make sure to not use {{Expression#isSatisfiedBy}} or anything that
uses it if there is {{CustomExpression}} involved") that makes the code hard to work with.

bq. could be provided by a new method {{Index#getPostIndexQueryLimits}}, similar to the existing

That'd work, though we really don't need indexes to implement it, we can use the {{Index#getIndexQueryFilter}}
from above. And so maybe we don't need to add a new method to {{Index}}, and instead just
add a {{DataLimits#withFiltering(RowFilter)}} method (that given a {{DataLimits}}, create
a new one that only count rows that match the provided {{RowFilter}}) and directly use that
with the result of {{Index#getIndexQueryFilter}} when we need it.

bq. Alternatively, we might just disable the limits at {{ReadCommand#executeLocally}} and
{{DataResolver#resolve}} for index queries, and let the indexes take care of restricting the
limits at search time

I'll admit I'm kind of opposed to that. There is always cases where we need to be able to
count stuffs post-query so not being able to do so with index queries would be clunky, if
not properly a blocker. For instance, I suspect it'll break paging without quite a bit of
special casing. Another example would be the row cache, where we do some stuffs around counting
that just wouldn't work in that case (don't get me wrong, we don't use row cache for 2i today,
I'm just trying to illustrate that we'd lose flexibility for future developments).

Further, it took some care in CASSANDRA-8099 to cleanly separate the counting of results from
the actual producer of data but that imo simplified and cleaned thinks up, and I'd hate to
start breaking that kind of "abstraction".

> 2ndary indexes can return stale data
> ------------------------------------
>                 Key: CASSANDRA-8272
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Sylvain Lebresne
>            Assignee: Andrés de la Peña
>             Fix For: 3.0.x
> When replica return 2ndary index results, it's possible for a single replica to return
a stale result and that result will be sent back to the user, potentially failing the CL contract.
> For instance, consider 3 replicas A, B and C, and the following situation:
> {noformat}
> CREATE TABLE test (k int PRIMARY KEY, v text);
> CREATE INDEX ON test(v);
> INSERT INTO test(k, v) VALUES (0, 'foo');
> {noformat}
> with every replica up to date. Now, suppose that the following queries are done at {{QUORUM}}:
> {noformat}
> UPDATE test SET v = 'bar' WHERE k = 0;
> SELECT * FROM test WHERE v = 'foo';
> {noformat}
> then, if A and B acknowledge the insert but C respond to the read before having applied
the insert, then the now stale result will be returned (since C will return it and A or B
will return nothing).
> A potential solution would be that when we read a tombstone in the index (and provided
we make the index inherit the gcGrace of it's parent CF), instead of skipping that tombstone,
we'd insert in the result a corresponding range tombstone.  

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message