Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 117901053E for ; Thu, 22 Jan 2015 19:20:37 +0000 (UTC) Received: (qmail 30310 invoked by uid 500); 22 Jan 2015 19:20:36 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 30266 invoked by uid 500); 22 Jan 2015 19:20:36 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 30252 invoked by uid 99); 22 Jan 2015 19:20:36 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Jan 2015 19:20:36 +0000 Date: Thu, 22 Jan 2015 19:20:36 +0000 (UTC) From: "Marcus Eriksson (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (CASSANDRA-7871) Reduce compaction IO in LCS MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-7871?page=3Dcom.atla= ssian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcus Eriksson updated CASSANDRA-7871: --------------------------------------- Fix Version/s: (was: 2.0.13) 3.0 > Reduce compaction IO in LCS=20 > ---------------------------- > > Key: CASSANDRA-7871 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7871 > Project: Cassandra > Issue Type: New Feature > Components: Core > Reporter: Dan Hendry > Assignee: Dan Hendry > Fix For: 3.0 > > Attachments: LeveledCompactionImprovement-2.0.10.patch, experimen= t.png, levelmultiplier.png, sstablesize.png > > > I have found LCS to be superior to STCS in almost every way - except for = the fact that it requires significantly more IO (a well advertised property= ). In leveled compaction, L ~n+1~ is 10 times larger than L ~n~ so generall= y 1+10 sstables need to be compacted to promote one sstable into the next l= evel. For certain workloads, this practically this means only 1/(10+1)=3D9%= of the IO, specifically write IO, is doing =E2=80=98useful=E2=80=99 work.= =20 > But why is each level 10 times larger? Why 10? Its a pretty looking numbe= r and all but thats not a very good reason to choose it. If we chose 5 or e= ven 2 we could reduce the =E2=80=98wasted=E2=80=99 io required to promote a= n sstable to the next level - of course at the expense of requiring more le= vels. I have not been able to find justification for this choice in either = cassandra or leveldb itself. I would like to introduce a new parameter, the= leveling multiplier, which controls the desired size difference between L = ~n~ and L ~n+1~. > First and foremost, a little math. Lets assume we have a CF of a fixed si= ze that is receiving continuous new data (ie: data is expiring due to TTLs = or is being overwritten). I believe the number of levels required is approx= imately (see note 1): > {noformat}data size =3D (sstable size)*(leveling multiplier)^(level count= ){noformat} > Which, when solving for the level count, becomes: > {noformat}level count =3D log((data size)/(sstable size))/log(leveling mu= ltiplier){noformat} > The amount of compaction write IO required over the lifetime of a particu= lar piece of data (excluding compactions in L0) is: > {noformat}write IO =3D (flush IO) + (promotion IO)*(level count) > write IO =3D 1 + (1 + (level multiplier))*log((data size)/(sstable size))= /log(leveling multiplier){noformat} > So ultimately, the the relationship between write IO and the level multip= lier is f\(x) =3D (1 + x)/log\(x) which is optimal at 3.59, or 4 if we roun= d to the nearest integer. Also note that write IO is proportional to log((d= ata size)/(sstable size)) which suggests using larger sstables would also r= educe disk IO. > As one final analytical step we can add the following term to approximate= STC in L0 (which is not actually how its implemented but should be close e= nough for moderate sstable sizes): > {noformat}L0 write IO =3D max(0, floor(log((sstable size)/(flush size))/l= og(4))){noformat} > The following two graphs illustrate the predicted compaction requirements= as a function of the leveling multiplier and sstable size: > !levelmultiplier.png!!sstablesize.png! > In terms of empirically verifying the expected results, I set up three ca= ssandra nodes, node A having a leveling multiplier of 10 and sstable size i= f 160 MB (current cassandra defaults), node B with multiplier 4 and size 16= 0 MB, and node C with multiplier 4 and size 1024 MB. I used a simple write = only workload which inserted data having a TTL of 2 days at 1 MB/second (se= e note 2). Compaction throttling was disabled and gc_grace was 60 seconds. = All nodes had dedicated data disks and IO measurements were for the data di= sks only. > !experiment.png! > ||Measure||Node A (10, 160MB)||Node B (4, 160MB)||Node C (4, 1024MB)|| > |Predicted IO Rate|34.4 MB/s|26.2 MB/s|20.5 MB/s| > |Predicted Improvement|n/a|23.8%|40.4%| > |Predicted Number of Levels (Expected Dataset of 169 GB)|3.0|5.0|3.7| > |Experimental IO Rate|32.0 MB/s|28.0 MB/s|20.4 MB/s| > |Experimental Improvement|n/a|12.4%|*36.3%*| > |Experimental Number of Levels|~4.1|~6.1|~4.8| > |Final Dataset Size (After 88 hours)|301 GB|261 GB|258 GB| > These results indicate that Node A performed better than expected, I susp= ect that this was due to the fact that the data insertion rate was a little= too high and compaction periodically got backlogged meaning the promotion = from L0 to L1 was more efficient. Also note that the actual dataset size is= larger than that used in the analytical model - which is expected as expir= ed data will not get purged immediately. The size difference between node A= and the others however seems suspicious to me. > In summary, these results, both theoretical and experimental, clearly ind= icate that reducing the level multiplier from 10 to 4 and increasing the ss= table size reduces compaction IO. The experimental results, using an SSTabl= e size of 1024 MB and level multiplier of 4, demonstrated a 36% reduction i= n write IO without a significant increase in the number of levels. I have n= ot run benchmarks for an update heavy workload but I suspect it would benef= it significantly since more data can be =E2=80=98updated=E2=80=99 per compa= ction. I have also not benchmarked read performance but I would not expect = noticeable performance degradation provided an sstable size is chosen which= keeps the number of levels roughly equal. > The patch I have attached is against 2.0.10 and does not change the defau= lts. Long term however, it would make sense to use more optimal defaults un= less there is compelling counter evidence to the performance gains observed= . > One final observation, in current leveled compaction the number of levels= is determined by the amount of data and the user specified sstable size. A= compaction strategy where instead the user selected the desired number of = levels and the strategy adjusted the SSTable size based on the amount of da= ta would have a number of benefits. The strategy would behave more consiste= ntly across a much wider range of dataset sizes. Compaction IO overhead (as= a function of write rate) and worst case read performance (number of sstab= les per read) would both be largely independent of dataset size. > Note 1: This equation only calculates the amount of data able to fit in t= he largest level. It would be more accurate take into account data in small= er levels (ie: using the geometric series equation) but this is a close eno= ugh approximation. There is also the fact that redundant data might be spre= ad across the various levels. > Note 2: This represents the entropy introduction rate and does not accoun= t for any Cassandra overhead but compression was also enabled. The row key = was a long, each row had 512 columns, the column name was a UUID, and the c= olumn value was a 64 byte blob. -- This message was sent by Atlassian JIRA (v6.3.4#6332)