cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Coli <>
Subject Re: Deduplicating data on a node (RF=1)
Date Tue, 18 Nov 2014 08:13:01 GMT
On Mon, Nov 17, 2014 at 12:04 PM, Alain Vandendorpe <>

> With bootstrapping and initial compactions finished that node now has what
> seems to be duplicate data, with almost exactly 2x the expected disk usage.
> CQL returns correct results but we depend on the ability to directly read
> the SSTable files (hence also RF=1.)
> Would anyone have suggestions on a good way to resolve this?

(If I understand correctly, the new node is now joined to the cluster; the
below assumes this.)

** The simplest, slightly inconvenient, way, which temporarily reduces
capacity :

1) nodetool cleanup # on the original node

This removes obsolete data from the original node, which has moved to the
new node. I mention this in case you did not already do it as part of
joining the new node. There are some edge cases [1] where you will "wake
up" old data if you haven't done this before decommission of the new node.

2) nodetool decommission # on the new node

This moves the data from the new node back onto the old node and removes
the new node from the cluster.

3) wipe data from new node # including the system keyspace

4) re-bootstrap new node

** The second simplest way, which requires using Size Tiered Compaction
Strategy (STS) but does not reduce capacity until step 2) :

1) nodetool compact

This will merge all your duplicates into One Big SSTable.

2) if necessary, once RF>1, use sstablesplit [1] (with the node down) to
split up your One Big SSTable.

If you're not using STS, you can temporarily switch to it, but 2) becomes
less straightforward.


View raw message