Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 05E6E1058A for ; Sat, 2 Nov 2013 10:37:25 +0000 (UTC) Received: (qmail 3344 invoked by uid 500); 2 Nov 2013 10:37:20 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 3322 invoked by uid 500); 2 Nov 2013 10:37:14 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 3314 invoked by uid 99); 2 Nov 2013 10:37:12 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 Nov 2013 10:37:12 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of horky@avast.com designates 209.85.215.169 as permitted sender) Received: from [209.85.215.169] (HELO mail-ea0-f169.google.com) (209.85.215.169) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 Nov 2013 10:37:05 +0000 Received: by mail-ea0-f169.google.com with SMTP id k11so2519324eaj.0 for ; Sat, 02 Nov 2013 03:36:45 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:message-id:date:from:user-agent:mime-version:to :cc:subject:references:in-reply-to:content-type; bh=TMTt+vgEVnhVYmPncJXA/PAn07bni5XNNxJuyyW93Us=; b=QXuNdt4rXdD/EzwgtQrl/QKvo0L03xXCr+4MPDJiHZTBLv6lbpR209ToR4dYXhVbO3 8J9qxbY6rbCPAD6MXNE6K4bmnJ9XKXbDtBAFCQ+oV+7m/Rfm1Ij9eBTEeCy7o81U++Ob 0fAYZRoRsUP55okHcvBG8Kd6O+E+qjsfkhiv6Uv8XhpxgdZSs7ClCHS5vNbt8LtuLvYo pivDaACI+rnPcFlS+vpMBLb3l8HWnAFiDNHgF7mq3OpdXTHEtyZGVoEX+7cQnGMeOgQB J4pXdBkg3iVsn7bMlScms95OnXUrlPScm7hlgWkARFKriSPdrrIQ4GFnUggsjMfbaePX e/og== X-Gm-Message-State: ALoCoQmyfqV0PK7ulL/M+UKtjYlnAM7gGBLBTmQqBiw/EEvmWbP/iWNhmc0GLyctD+qnJqWaEfy+ X-Received: by 10.14.48.14 with SMTP id u14mr3076824eeb.74.1383388605409; Sat, 02 Nov 2013 03:36:45 -0700 (PDT) Received: from [172.31.1.101] (ip-62-245-69-12.net.upcbroadband.cz. [62.245.69.12]) by mx.google.com with ESMTPSA id i1sm19011558eeg.0.2013.11.02.03.36.44 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sat, 02 Nov 2013 03:36:44 -0700 (PDT) Message-ID: <5274D5B6.8050303@avast.com> Date: Sat, 02 Nov 2013 11:36:38 +0100 From: Jiri Horky User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130827 Thunderbird/17.0.8 MIME-Version: 1.0 To: user@cassandra.apache.org CC: Robert Coli , =?UTF-8?B?WmRlbsSbayBPdHQ=?= Subject: Re: Recompacting all sstables References: <5274056C.4010401@avast.com> In-Reply-To: Content-Type: multipart/alternative; boundary="------------030601000301050200080301" X-Virus-Checked: Checked by ClamAV on apache.org This is a multi-part message in MIME format. --------------030601000301050200080301 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Hi, On 11/01/2013 09:15 PM, Robert Coli wrote: > On Fri, Nov 1, 2013 at 12:47 PM, Jiri Horky > wrote: > > since we upgraded half of our Cassandra cluster to 2.0.0 and we > use LCS, > we hit CASSANDRA-6284 bug. > > > 1) Why upgrade a cluster to 2.0.0? Hopefully not a production cluster? [1] I think you already guessed the answer :) It is a production cluster, we needed some features (particularly, compare and set) only present in 2.0 because of the applications. Besides, somebody had to discover the regression, right? :) Thanks for the link. > > 3) What do you mean by "upgraded half of our Cassandra cluster"? That > is Not Supported and also Not Advised... for example, before the > streaming change in 2.x line, a cluster in such a state may be unable > to have nodes added, removed or replaced. We are in the middle of the migration from 1.2.9 to 2.0 when we are also upgrading our application which can only be run against 2.0 due to various technical details. It is rather hard to explain, but we hoped it will last just for few days and it is definitely not the status we wanted to keep. Since we hit the bug, we got stalled in the middle of the migration. > > So the question. What is the best way to recompact all the sstables so > the data in one sstables within a level would contain more or less the > right portion of the data > > ... > > Based on documentation, I can only think of switching to SizeTiered > compaction, doing major compaction and then switching back to LCS. > > > That will work, though be aware of the implication of CASSANDRA-6092 > [2]. Briefly, if the CF in question is not receiving write load, you > will be unable to promote your One Big SSTable from L0 to L1. In that > case, you might want to consider running sstable_split (and then > restarting the node) in order to split your One Big SSTable into two > or more smaller ones. Hmm, thinking about it a bit more, I am unsure this will actually help. If I understand things correctly, assuming uniform distribution of new received keys in L0 (ensured by RandomPartitioner), in order for LCS to work optimally, I need: a) get uniform distribution of keys across sstables in one level, i.e. in every level each sstable will cover more or less the same range of keys b) sstables in each level should cover almost whole space of keys the node is responsible for c) propagate sstables to higher levels in uniform fashion, e.g. round-robin or random (over time, the probability of choosing an sstables as candidate should be the same for all sstables in the level) By splitting the sorted Big SStable, I will get a bunch of non-overlapping sstables. So I will surely achieve a). Point c) is fixed by the patch. But what about b)? It probably depends on order of compaction across levels, i.e. whether the compactions in various levels are being run in parallel and interleaved or not. In case it compacts all the tables from one level and only after that starts to compact sstables in higher level etc, one will end up in very similar situation as caused by the referenced bug (because of round robin fashion of choosing candidates), i.e. having the biggest keys in L1 and smallest keys in the highest level. So in this case, it would actually not help at all. Does it make sense or am I completely wrong? :) BTW: Not very though-out idea, but wouldn't it actually be better to select candidates completely randomly? Cheers Jiri Horky > > =Rob > > [1] https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/ > [2] https://issues.apache.org/jira/browse/CASSANDRA-6092 --------------030601000301050200080301 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 8bit Hi,

On 11/01/2013 09:15 PM, Robert Coli wrote:
On Fri, Nov 1, 2013 at 12:47 PM, Jiri Horky <horky@avast.com> wrote:
since we upgraded half of our Cassandra cluster to 2.0.0 and we use LCS,
we hit CASSANDRA-6284 bug.

1) Why upgrade a cluster to 2.0.0? Hopefully not a production cluster? [1]
I think you already guessed the answer :) It is a production cluster, we needed some features (particularly, compare and set) only present in 2.0 because of the applications. Besides, somebody had to discover the regression, right? :) Thanks for the link.

3) What do you mean by "upgraded half of our Cassandra cluster"? That is Not Supported and also Not Advised... for example, before the streaming change in 2.x line, a cluster in such a state may be unable to have nodes added, removed or replaced.
We are in the middle of the migration from 1.2.9 to 2.0 when we are also upgrading our application which can only be run against 2.0 due  to various technical details. It is rather hard to explain, but we hoped it will last just for few days and it is definitely not the status we wanted to keep. Since we hit the bug, we got stalled in the middle of the migration.

So the question. What is the best way to recompact all the sstables so
the data in one sstables within a level would contain more or less the
right portion of the data
... 
Based on documentation, I can only think of switching to SizeTiered
compaction, doing major compaction and then switching back to LCS.

That will work, though be aware of  the implication of CASSANDRA-6092 [2]. Briefly, if the CF in question is not receiving write load, you will be unable to promote your One Big SSTable from L0 to L1. In that case, you might want to consider running sstable_split (and then restarting the node) in order to split your One Big SSTable into two or more smaller ones.
Hmm, thinking about it a bit more, I am unsure this will actually help. If I understand things correctly, assuming uniform distribution of new received keys in L0 (ensured by RandomPartitioner), in order for LCS to work optimally, I need:

a) get uniform distribution of keys across sstables in one level, i.e. in every level each sstable will cover more or less the same range of keys
b) sstables in each level should cover almost whole space of keys the node is responsible for
c) propagate sstables to higher levels in uniform fashion, e.g. round-robin or random (over time, the probability of choosing an sstables as candidate should be the same for all sstables in the level)

By splitting the sorted Big SStable, I will get a bunch of non-overlapping sstables. So I will surely achieve a). Point c) is fixed by the patch. But what about b)? It probably depends on order of compaction across levels, i.e. whether the compactions in various levels are being run in parallel and interleaved or not. In case it compacts all the tables from one level and only after that starts to compact sstables in higher level etc, one will end up in very similar situation as caused by the referenced bug (because of round robin fashion of choosing candidates), i.e. having the biggest keys in L1 and smallest keys in the highest level. So in this case, it would actually not help at all.

Does it make sense or am I completely wrong? :)

BTW: Not very though-out idea, but wouldn't it actually be better to select candidates completely randomly?

Cheers
Jiri Horky


--------------030601000301050200080301--