Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 3296F200C08 for ; Thu, 26 Jan 2017 13:07:32 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 31349160B4C; Thu, 26 Jan 2017 12:07:32 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 55FC6160B40 for ; Thu, 26 Jan 2017 13:07:31 +0100 (CET) Received: (qmail 98189 invoked by uid 500); 26 Jan 2017 12:07:30 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 98178 invoked by uid 99); 26 Jan 2017 12:07:30 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Jan 2017 12:07:30 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 041701A0809 for ; Thu, 26 Jan 2017 12:07:30 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -1.199 X-Spam-Level: X-Spam-Status: No, score=-1.199 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RP_MATCHES_RCVD=-2.999] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id z8TOaPeflTiG for ; Thu, 26 Jan 2017 12:07:28 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 3E5675F644 for ; Thu, 26 Jan 2017 12:07:28 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 19F54E040A for ; Thu, 26 Jan 2017 12:07:27 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id B19A82528D for ; Thu, 26 Jan 2017 12:07:25 +0000 (UTC) Date: Thu, 26 Jan 2017 12:07:25 +0000 (UTC) From: "Stefan Podkowinski (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-13153) Reappeared Data when Mixing Incremental and Full Repairs MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Thu, 26 Jan 2017 12:07:32 -0000 [ https://issues.apache.org/jira/browse/CASSANDRA-13153?page=3Dcom.atla= ssian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId= =3D15839622#comment-15839622 ]=20 Stefan Podkowinski commented on CASSANDRA-13153: ------------------------------------------------ Thanks reporting this, [~Amanda.Debrot]! Let me try to wrap-up again what's= happending here.. I think the assumption was that anti-compaction will isolate repaired range= s into the repaired set of sstables, while parts of sstables not covered by= the repair will stay in the unrepaired set. As described by Amanda, troubl= e starts when anti-compaction is taking place exclusively on already repair= ed sstables. Once we've finished repairing a certain range using full repai= r, anti-compaction will move unaffected ranges in overlapping sstables from= the repaired into unrepaired set again, even if ranges have actually alrea= dy been repaired before. As the overlap between ranges and sstables is non-= deterministic, we could either see regular cells, tombstones or both being = move to unrepaired, based on whether the sstable happens to overlap or not.= =20 Unfortunately this is not the only way that this could happen. As described= in CASSANDRA-9143, compactions during the repairs can prevent anti-compact= ion for individual sstables and tombstones and data could end up in differe= nt sets in this case as well.=20 bq. I've only tested it on Cassandra version 2.2 but it most likely also a= ffects all Cassandra versions with incremental repair - like 2.1 and 3.0. I think 2.1 should not be affected, as we started doing anti-compactions fo= r full repairs in 2.2. > Reappeared Data when Mixing Incremental and Full Repairs > -------------------------------------------------------- > > Key: CASSANDRA-13153 > URL: https://issues.apache.org/jira/browse/CASSANDRA-1315= 3 > Project: Cassandra > Issue Type: Bug > Components: Compaction, Tools > Environment: Apache Cassandra 2.2 > Reporter: Amanda Debrot > Labels: Cassandra > Attachments: log-Reappeared-Data.txt, Step-by-Step-Simulate-Reapp= eared-Data.txt > > > This happens for both LeveledCompactionStrategy and SizeTieredCompactionS= trategy. I've only tested it on Cassandra version 2.2 but it most likely a= lso affects all Cassandra versions with incremental repair - like 2.1 and 3= .0. > When mixing incremental and full repairs, there are a few scenarios where= the Data SSTable is marked as unrepaired and the Tombstone SSTable is mark= ed as repaired. Then if it is past gc_grace, and the tombstone and data ha= s been compacted out on other replicas, the next incremental repair will pu= sh the Data to other replicas without the tombstone. > Simplified scenario: > 3 node cluster with RF=3D3 > Intial config: > =09Node 1 has data and tombstone in separate SSTables. > =09Node 2 has data and no tombstone. > =09Node 3 has data and tombstone in separate SSTables. > Incremental repair (nodetool repair -pr) is run every day so now we have = tombstone on each node. > Some minor compactions have happened since so data and tombstone get merg= ed to 1 SSTable on Nodes 1 and 3. > =09Node 1 had a minor compaction that merged data with tombstone. 1 SSTab= le with tombstone. > =09Node 2 has data and tombstone in separate SSTables. > =09Node 3 had a minor compaction that merged data with tombstone. 1 SSTab= le with tombstone. > Incremental repairs keep running every day. > Full repairs run weekly (nodetool repair -full -pr).=20 > Now there are 2 scenarios where the Data SSTable will get marked as "Unre= paired" while Tombstone SSTable will get marked as "Repaired". > Scenario 1: > Since the Data and Tombstone SSTable have been marked as "Repaire= d" and anticompacted, they have had minor compactions with other SSTables c= ontaining keys from other ranges. During full repair, if the last node to = run it doesn't own this particular key in it's partitioner range, the Data = and Tombstone SSTable will get anticompacted and marked as "Unrepaired". N= ow in the next incremental repair, if the Data SSTable is involved in a min= or compaction during the repair but the Tombstone SSTable is not, the resul= ting compacted SSTable will be marked "Unrepaired" and Tombstone SSTable is= marked "Repaired". > Scenario 2: > Only the Data SSTable had minor compaction with other SSTables co= ntaining keys from other ranges after being marked as "Repaired". The Tomb= stone SSTable was never involved in a minor compaction so therefore all key= s in that SSTable belong to 1 particular partitioner range. During full rep= air, if the last node to run it doesn't own this particular key in it's par= titioner range, the Data SSTable will get anticompacted and marked as "Unre= paired". The Tombstone SSTable stays marked as Repaired. > Then it=E2=80=99s past gc_grace. Since Node=E2=80=99s #1 and #3 only hav= e 1 SSTable for that key, the tombstone will get compacted out. > =09Node 1 has nothing. > =09Node 2 has data (in unrepaired SSTable) and tombstone (in repaired SST= able) in separate SSTables. > =09Node 3 has nothing. > Now when the next incremental repair runs, it will only use the Data SSTa= ble to build the merkle tree since the tombstone SSTable is flagged as repa= ired and data SSTable is marked as unrepaired. And the data will get repai= red against the other two nodes. > =09Node 1 has data. > =09Node 2 has data and tombstone in separate SSTables. > =09Node 3 has data. > If a read request hits Node 1 and 3, it will return data. If it hits 1 a= nd 2, or 2 and 3, however, it would return no data. > Tested this with single range tokens for simplicity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)