cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fabien Rousseau (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
Date Tue, 05 Apr 2016 21:29:25 GMT


Fabien Rousseau commented on CASSANDRA-11349:

Using the RangeTombstone.Tracker can help in the situation described just above.

In fact, the RT should always update the tracker (see CASSANDRA-11477).
The trick here is to always considered it as "expired" in the tracker (even if not) so the
tombstones are not accumulated during compaction (if expired the tracker keeps only the list
of opened RTs and if not, it keeps all unwritten RTs, ie all RTs because it's a validation

Having a look at the update method of the Tracker, it already check if the tombstone is superseded
by another one (and don't add it as "opened" if superseded).

Thus, the v2 patch:
 - includes the previous patch
 - always update the tracker with the RT (considering it as expired even if not, just to not
retain too many of them in memory, and because it's for validation, it's a read only and won't
affect anything)
 - test if the RT was added in the openedTombstones list, and if that's not the case, skip
it for digest.

I know that the patch may be a bit rough (at least on the "isLastOpened" method) but it is
more to validate the approach first and did not want the patch to be too invasive (by modifying
the returned value of the update method).


Note: I have not yet tested it against our production data
Note2: Regarding the read-repair, this seems to be a different story and can't see anything
for now that could explain those differences (will dig later on this as this is less urgent)

> MerkleTree mismatch when multiple range tombstones exists for the same partition and
> ---------------------------------------------------------------------------------------------
>                 Key: CASSANDRA-11349
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Fabien Rousseau
>            Assignee: Stefan Podkowinski
>              Labels: repair
>             Fix For: 2.1.x, 2.2.x
>         Attachments: 11349-2.1.patch
> We observed that repair, for some of our clusters, streamed a lot of data and many partitions
were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, which is really
> After investigation, it appears that, if two range tombstones exists for a partition
for the same range/interval, they're both included in the merkle tree computation.
> But, if for some reason, on another node, the two range tombstones were already compacted
into a single range tombstone, this will result in a merkle tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent on compactions
(and if a partition is deleted and created multiple times, the only way to ensure that repair
"works correctly"/"don't overstream data" is to major compact before each repair... which
is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 'replication_factor':
> USE test_rt;
>     c1 text,
>     c2 text,
>     c3 float,
>     c4 float,
>     PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected between nodes
(while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables (up to thousands
for a rather short period of time when using VNodes, the time for compaction to absorb those
small files), but also an increased size on disk.

This message was sent by Atlassian JIRA

View raw message