cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-8933) Short reads can return deleted results
Date Fri, 20 Mar 2015 18:22:39 GMT


Sylvain Lebresne commented on CASSANDRA-8933:

Turns out this example is not a problem out of well, sheer luck. In {{SliceQueryFilter.collectReducedColumns}},
the stop condition is {{ > count}}, which in practice means that even
when a replica has found has many results as requested, it will still include every tombstones
until it find the new live results. The only reason it was done this way is that for CQL we
have to group cells into rows for counting and the condition used is an easy way to make sure
we don't stop before having included every cell of our last row. The fact it includes tombstones
afterwards is to some extend an inefficiency which we have been too lazy but turns out that
it avoids the problem I've described above.

I'll take a few minutes to think if there is other cases that could still be problematic but
it's actually possible that it's not a problem (it probably was one before 2.0).

> Short reads can return deleted results
> --------------------------------------
>                 Key: CASSANDRA-8933
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Sylvain Lebresne
> The current code for short reads protection does not handle all cases.  Currently, we
retry only if a node had returned the requested number of results, but we have less results
than that post-reconciliation, because this means the node in question may have more results
it hadn't sent due to the limit.
> Consider however 3 nodes A, B, C (RF=3), and following sequence of operations (all done
> # we write 1 and 2 in a partition: all nodes get it.
> # we delete 1: only A and C get it.
> # we delete 2: only B and C get it.
> # we read the first row in the partition (so with a LIMIT 1) and A and B answer first.
> At the last step, A will return the tombstone for 1 and the value 2, while B will return
just 1. So post reconciliation, we'll return 2 (since A returned it and we have no tombstone
for it), while we should return nothing. This is a short read situation: B stopped at 1 because
it was asked only 1 result, but that result didn't made it in the result and we need further
results from it.  However, Because 1 results is requested and we have 1 result post-reconciliation,
the short read retry won't kick in.
> In practice, the short read check should be generalized: if any node X returns the requested
number of results but any of those results gets skipped post-reconciliation, we might have
a short read. Basically, enforcing the limit replica-side is optimistic and assumes that all
results of that replica will be used, and as soon as that assumption fails we should get back
more results.
> Implementing that generalized condition can probably be done in RowDataResolver.scheduleRepairs
by using the repair to know if a node has had some of results skipped by reconciliation but
we want to know if a full CQL row has been skipped or not so this will probably force us to
add some recounting.
> I'll note that I've fixed this problem on my branch for CASSANDRA-8099 (where this is
both simpler and somewhat more efficient since short reads don't retry full queries there),
so if decide this is too risky to fix in 2.1, we can possibly just mark this as duplicate
of CASSANDRA-8099.
> Lastly, it shouldn't be too hard to extends our current short read dtests to test for
that case, but I haven't taken the time to do so yet ([~philipthompson] do you think you can
have a look at adding such test at some point?).

This message was sent by Atlassian JIRA

View raw message