cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcus Eriksson (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-6696) Partition sstables by token range
Date Mon, 16 Mar 2015 16:22:41 GMT


Marcus Eriksson commented on CASSANDRA-6696:

A bit of a status update;

This is now basically 3 parts;
# multi threaded flushing - one thread per disk, splits the owned token range evenly over
the drives
# one compaction strategy instance per disk
# optional vnode aware compaction strategy that you can use if you are using vnodes:
** keeps 2 levels of sstables, level 0 is newly flushed, bigger sstables, level 1 contains
sstables per vnode
** to avoid getting massive amounts of sstables in L1, we don't compact a vnode into L1 until
we approximate that we can reach a configurable sstable size. During an L0 compaction (which
contains data from all vnodes) we approximate if "the next" vnode has enough data for a L1
sstable, otherwise we keep the data for that vnode in L0 until the next compaction.
** within each vnode we do size tiering

* rebalancing after ring changes and when disks break
* we can flush before knowing what ranges we own (ie during commit log replay for example)
- we might need to persist which tokens this node has (this includes local tokens and others
we have due to replication) 
* improving compaction strategy heuristics

> Partition sstables by token range
> ---------------------------------
>                 Key: CASSANDRA-6696
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: sankalp kohli
>            Assignee: Marcus Eriksson
>              Labels: compaction, correctness, dense-storage, performance
>             Fix For: 3.0
> In JBOD, when someone gets a bad drive, the bad drive is replaced with a new empty one
and repair is run. 
> This can cause deleted data to come back in some cases. Also this is true for corrupt
stables in which we delete the corrupt stable and run repair. 
> Here is an example:
> Say we have 3 nodes A,B and C and RF=3 and GC grace=10days. 
> row=sankalp col=sankalp is written 20 days back and successfully went to all three nodes.

> Then a delete/tombstone was written successfully for the same row column 15 days back.

> Since this tombstone is more than gc grace, it got compacted in Nodes A and B since it
got compacted with the actual data. So there is no trace of this row column in node A and
> Now in node C, say the original data is in drive1 and tombstone is in drive2. Compaction
has not yet reclaimed the data and tombstone.  
> Drive2 becomes corrupt and was replaced with new empty drive. 
> Due to the replacement, the tombstone in now gone and row=sankalp col=sankalp has come
back to life. 
> Now after replacing the drive we run repair. This data will be propagated to all nodes.

> Note: This is still a problem even if we run repair every gc grace. 

This message was sent by Atlassian JIRA

View raw message