cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Podkowinski (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
Date Wed, 11 May 2016 13:14:13 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280088#comment-15280088
] 

Stefan Podkowinski commented on CASSANDRA-11349:
------------------------------------------------

Thanks for the clarification. It's really helpful to understand the intention of how those
parts are suppose to work together. 

The serializer approach seems to be a good idea how to handle this, but  there are still [cases|https://github.com/spodkowinski/cassandra-dtest/blob/b110685bceddbcb63ebc744ba54a25cb268f2478/repair_tests/repair_test.py#L438:L451]
\[1\] not handled correctly. I'm going to take a closer look to understand why. I'd also like
to do some more testing for potential digest mismatch storms during rolling upgrades, but
wouldn't expect any blockers so far. 

\[1\] nosetests repair_tests/repair_test.py:TestRepair.shadowed_range_tombstone_digest_parallel_repair_test


> MerkleTree mismatch when multiple range tombstones exists for the same partition and
interval
> ---------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-11349
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Fabien Rousseau
>            Assignee: Stefan Podkowinski
>              Labels: repair
>             Fix For: 2.1.x, 2.2.x
>
>         Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and many partitions
were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, which is really
high.
> After investigation, it appears that, if two range tombstones exists for a partition
for the same range/interval, they're both included in the merkle tree computation.
> But, if for some reason, on another node, the two range tombstones were already compacted
into a single range tombstone, this will result in a merkle tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent on compactions
(and if a partition is deleted and created multiple times, the only way to ensure that repair
"works correctly"/"don't overstream data" is to major compact before each repair... which
is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 'replication_factor':
2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
>     c1 text,
>     c2 text,
>     c3 float,
>     c4 float,
>     PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected between nodes
(while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables (up to thousands
for a rather short period of time when using VNodes, the time for compaction to absorb those
small files), but also an increased size on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message