Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Mon, 24 Mar 2014 14:45:46 +0000 (UTC)
From: "Marcus Eriksson (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12694767.1392221867815.131579.1395672346001@arcas>
In-Reply-To: <JIRA.12694767.1392221867815@arcas>
References: <JIRA.12694767.1392221867815@arcas>
Subject: [jira] [Commented] (CASSANDRA-6696) Drive replacement in JBOD can
 cause data to reappear.
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/CASSANDRA-6696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13945173#comment-13945173 ] 

Marcus Eriksson commented on CASSANDRA-6696:
--------------------------------------------

Been poking this, wip-patch pushed here: https://github.com/krummas/cassandra/commits/marcuse/6696

it does the following;
* Extract an interface out of SSTableWriter (imaginatively called SSTableWriterInterface), start using this interface everywhere
* Create DiskAwareSSTableWriter which knows about disk layout and starts using it instead of standard SSTW
* Ranges of tokens are assigned to the disks, this way we only need to check "is the key we are appending larger than the boundary token for the current disk? If so, create a new SSTableWriter for that disk
* Breaks unit tests

todo:
* fix unit tests, general cleanups
* I kind of want to name the interface SSTableWriter and call the old SSTW class something else, but i guess SSTW is the class that most external people depend on, so maybe not
* Take disk size into consideration when splitting the ranges over disks, this needs to be deterministic though, so we have to use total disk size instead of free disk space.
* Make other partitioners than M3P work
* Fix keycache

Rebalancing of data is simply running upgradesstables or scrub, if we loose a disk, we will take writes to the other disks

Comments on this approach?

> Drive replacement in JBOD can cause data to reappear. 
> ------------------------------------------------------
>
>                 Key: CASSANDRA-6696
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6696
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: sankalp kohli
>            Assignee: Marcus Eriksson
>             Fix For: 3.0
>
>
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new empty one and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for corrupt stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all three nodes. 
> Then a delete/tombstone was written successfully for the same row column 15 days back. 
> Since this tombstone is more than gc grace, it got compacted in Nodes A and B since it got compacted with the actual data. So there is no trace of this row column in node A and B.
> Now in node C, say the original data is in drive1 and tombstone is in drive2. Compaction has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp has come back to life. 
> Now after replacing the drive we run repair. This data will be propagated to all nodes. 
> Note: This is still a problem even if we run repair every gc grace. 
>  


--
This message was sent by Atlassian JIRA
(v6.2#6252)