On 11/01/2013 09:15 PM, Robert Coli wrote:
On Fri, Nov 1, 2013 at 12:47 PM, Jiri Horky <horky@avast.com> wrote:
since we upgraded half of our Cassandra cluster to 2.0.0 and we use LCS,
we hit CASSANDRA-6284 bug.

1) Why upgrade a cluster to 2.0.0? Hopefully not a production cluster? [1]
I think you already guessed the answer :) It is a production cluster, we needed some features (particularly, compare and set) only present in 2.0 because of the applications. Besides, somebody had to discover the regression, right? :) Thanks for the link.

3) What do you mean by "upgraded half of our Cassandra cluster"? That is Not Supported and also Not Advised... for example, before the streaming change in 2.x line, a cluster in such a state may be unable to have nodes added, removed or replaced.
We are in the middle of the migration from 1.2.9 to 2.0 when we are also upgrading our application which can only be run against 2.0 due  to various technical details. It is rather hard to explain, but we hoped it will last just for few days and it is definitely not the status we wanted to keep. Since we hit the bug, we got stalled in the middle of the migration.

So the question. What is the best way to recompact all the sstables so
the data in one sstables within a level would contain more or less the
right portion of the data
Based on documentation, I can only think of switching to SizeTiered
compaction, doing major compaction and then switching back to LCS.

That will work, though be aware of  the implication of CASSANDRA-6092 [2]. Briefly, if the CF in question is not receiving write load, you will be unable to promote your One Big SSTable from L0 to L1. In that case, you might want to consider running sstable_split (and then restarting the node) in order to split your One Big SSTable into two or more smaller ones.
Hmm, thinking about it a bit more, I am unsure this will actually help. If I understand things correctly, assuming uniform distribution of new received keys in L0 (ensured by RandomPartitioner), in order for LCS to work optimally, I need:

a) get uniform distribution of keys across sstables in one level, i.e. in every level each sstable will cover more or less the same range of keys
b) sstables in each level should cover almost whole space of keys the node is responsible for
c) propagate sstables to higher levels in uniform fashion, e.g. round-robin or random (over time, the probability of choosing an sstables as candidate should be the same for all sstables in the level)

By splitting the sorted Big SStable, I will get a bunch of non-overlapping sstables. So I will surely achieve a). Point c) is fixed by the patch. But what about b)? It probably depends on order of compaction across levels, i.e. whether the compactions in various levels are being run in parallel and interleaved or not. In case it compacts all the tables from one level and only after that starts to compact sstables in higher level etc, one will end up in very similar situation as caused by the referenced bug (because of round robin fashion of choosing candidates), i.e. having the biggest keys in L1 and smallest keys in the highest level. So in this case, it would actually not help at all.

Does it make sense or am I completely wrong? :)

BTW: Not very though-out idea, but wouldn't it actually be better to select candidates completely randomly?

Jiri Horky