Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Tue, 10 May 2016 11:43:13 +0000 (UTC)
From: "Stefan Podkowinski (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12949698.1457869725000.152154.1462880593003@Atlassian.JIRA>
In-Reply-To: <JIRA.12949698.1457869725000@Atlassian.JIRA>
References: <JIRA.12949698.1457869725000@Atlassian.JIRA> <JIRA.12949698.1457869725047@arcas>
Subject: [jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when
 multiple range tombstones exists for the same partition and interval
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Tue, 10 May 2016 11:43:15 -0000


    [ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277977#comment-15277977 ] 

Stefan Podkowinski commented on CASSANDRA-11349:
------------------------------------------------

To quickly sum up the current behavior.. {{ColumnIndex.Builder}} is created for each {{LazilyCompactedRow.update()}} call. The builder will iterate through all atoms produced by the {{MergeIterator}} and uses a {{RangeTombstone.Tracker}} instance for tombstone normalization. Tombstones will be added to the tracker from {{Builder.add()}} and by {{LCR.Reducer.getReduced()}}, which in turn will be called once for all atoms for the same column as considered by {{onDiskAtomComparator}}. 

[~blambov], so what you're saying is that we can't be sure that the {{MergeIterator}} will always be able to provide deterministic ordered values, as write order may be different and we therefor cannot simply iterate through the reducer to create a correct digest. 

What I'm a bit concerned about while trying to understand Branimir's approach is that at some point {{getReduced()}} will add the RT to the tracker while in another scenario the RT will be added later and will cause the serializer be called differently as well. Or to put this in other words, if we can't be sure about the reducer returning deterministic ordered values, won't this effect the tracker and digest calculation in the builder as well?

> MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
> ---------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-11349
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Fabien Rousseau
>            Assignee: Stefan Podkowinski
>              Labels: repair
>             Fix For: 2.1.x, 2.2.x
>
>         Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, which is really high.
> After investigation, it appears that, if two range tombstones exists for a partition for the same range/interval, they're both included in the merkle tree computation.
> But, if for some reason, on another node, the two range tombstones were already compacted into a single range tombstone, this will result in a merkle tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent on compactions (and if a partition is deleted and created multiple times, the only way to ensure that repair "works correctly"/"don't overstream data" is to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
>     c1 text,
>     c2 text,
>     c3 float,
>     c4 float,
>     PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables (up to thousands for a rather short period of time when using VNodes, the time for compaction to absorb those small files), but also an increased size on disk.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)