Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Thu, 22 Jan 2015 19:20:36 +0000 (UTC)
From: "Marcus Eriksson (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12738840.1409762891000.149499.1421954436709@Atlassian.JIRA>
In-Reply-To: <JIRA.12738840.1409762891000@Atlassian.JIRA>
References: <JIRA.12738840.1409762891000@Atlassian.JIRA>
 <JIRA.12738840.1409762891820@arcas>
Subject: [jira] [Updated] (CASSANDRA-7871) Reduce compaction IO in LCS
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


     [ https://issues.apache.org/jira/browse/CASSANDRA-7871?page=3Dcom.atla=
ssian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marcus Eriksson updated CASSANDRA-7871:
---------------------------------------
    Fix Version/s:     (was: 2.0.13)
                   3.0

> Reduce compaction IO in LCS=20
> ----------------------------
>
>                 Key: CASSANDRA-7871
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7871
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Dan Hendry
>            Assignee: Dan Hendry
>             Fix For: 3.0
>
>         Attachments: LeveledCompactionImprovement-2.0.10.patch, experimen=
t.png, levelmultiplier.png, sstablesize.png
>
>
> I have found LCS to be superior to STCS in almost every way - except for =
the fact that it requires significantly more IO (a well advertised property=
). In leveled compaction, L ~n+1~ is 10 times larger than L ~n~ so generall=
y 1+10 sstables need to be compacted to promote one sstable into the next l=
evel. For certain workloads, this practically this means only 1/(10+1)=3D9%=
 of the IO, specifically write IO, is doing =E2=80=98useful=E2=80=99 work.=
=20
> But why is each level 10 times larger? Why 10? Its a pretty looking numbe=
r and all but thats not a very good reason to choose it. If we chose 5 or e=
ven 2 we could reduce the =E2=80=98wasted=E2=80=99 io required to promote a=
n sstable to the next level - of course at the expense of requiring more le=
vels. I have not been able to find justification for this choice in either =
cassandra or leveldb itself. I would like to introduce a new parameter, the=
 leveling multiplier, which controls the desired size difference between L =
~n~ and L ~n+1~.
> First and foremost, a little math. Lets assume we have a CF of a fixed si=
ze that is receiving continuous new data (ie: data is expiring due to TTLs =
or is being overwritten). I believe the number of levels required is approx=
imately (see note 1):
> {noformat}data size =3D (sstable size)*(leveling multiplier)^(level count=
){noformat}
> Which, when solving for the level count, becomes:
> {noformat}level count =3D log((data size)/(sstable size))/log(leveling mu=
ltiplier){noformat}
> The amount of compaction write IO required over the lifetime of a particu=
lar piece of data (excluding compactions in L0) is:
> {noformat}write IO =3D (flush IO) + (promotion IO)*(level count)
> write IO =3D 1 + (1 + (level multiplier))*log((data size)/(sstable size))=
/log(leveling multiplier){noformat}
> So ultimately, the the relationship between write IO and the level multip=
lier is f\(x) =3D (1 + x)/log\(x) which is optimal at 3.59, or 4 if we roun=
d to the nearest integer. Also note that write IO is proportional to log((d=
ata size)/(sstable size)) which suggests using larger sstables would also r=
educe disk IO.
> As one final analytical step we can add the following term to approximate=
 STC in L0 (which is not actually how its implemented but should be close e=
nough for moderate sstable sizes):
> {noformat}L0 write IO =3D max(0, floor(log((sstable size)/(flush size))/l=
og(4))){noformat}
> The following two graphs illustrate the predicted compaction requirements=
 as a function of the leveling multiplier and sstable size:
> !levelmultiplier.png!!sstablesize.png!
> In terms of empirically verifying the expected results, I set up three ca=
ssandra nodes, node A having a leveling multiplier of 10 and sstable size i=
f 160 MB (current cassandra defaults), node B with multiplier 4 and size 16=
0 MB, and node C with multiplier 4 and size 1024 MB. I used a simple write =
only workload which inserted data having a TTL of 2 days at 1 MB/second (se=
e note 2). Compaction throttling was disabled and gc_grace was 60 seconds. =
All nodes had dedicated data disks and IO measurements were for the data di=
sks only.
> !experiment.png!
> ||Measure||Node A (10, 160MB)||Node B (4, 160MB)||Node C (4, 1024MB)||
> |Predicted IO Rate|34.4 MB/s|26.2 MB/s|20.5 MB/s|
> |Predicted Improvement|n/a|23.8%|40.4%|
> |Predicted Number of Levels (Expected Dataset of 169 GB)|3.0|5.0|3.7|
> |Experimental IO Rate|32.0 MB/s|28.0 MB/s|20.4 MB/s|
> |Experimental Improvement|n/a|12.4%|*36.3%*|
> |Experimental Number of Levels|~4.1|~6.1|~4.8|
> |Final Dataset Size (After 88 hours)|301 GB|261 GB|258 GB|
> These results indicate that Node A performed better than expected, I susp=
ect that this was due to the fact that the data insertion rate was a little=
 too high and compaction periodically got backlogged meaning the promotion =
from L0 to L1 was more efficient. Also note that the actual dataset size is=
 larger than that used in the analytical model - which is expected as expir=
ed data will not get purged immediately. The size difference between node A=
 and the others however seems suspicious to me.
> In summary, these results, both theoretical and experimental, clearly ind=
icate that reducing the level multiplier from 10 to 4 and increasing the ss=
table size reduces compaction IO. The experimental results, using an SSTabl=
e size of 1024 MB and level multiplier of 4, demonstrated a 36% reduction i=
n write IO without a significant increase in the number of levels. I have n=
ot run benchmarks for an update heavy workload but I suspect it would benef=
it significantly since more data can be =E2=80=98updated=E2=80=99 per compa=
ction. I have also not benchmarked read performance but I would not expect =
noticeable performance degradation provided an sstable size is chosen which=
 keeps the number of levels roughly equal.
> The patch I have attached is against 2.0.10 and does not change the defau=
lts. Long term however, it would make sense to use more optimal defaults un=
less there is compelling counter evidence to the performance gains observed=
.
> One final observation, in current leveled compaction the number of levels=
 is determined by the amount of data and the user specified sstable size. A=
 compaction strategy where instead the user selected the desired number of =
levels and the strategy adjusted the SSTable size based on the amount of da=
ta would have a number of benefits. The strategy would behave more consiste=
ntly across a much wider range of dataset sizes. Compaction IO overhead (as=
 a function of write rate) and worst case read performance (number of sstab=
les per read) would both be largely independent of dataset size.
> Note 1: This equation only calculates the amount of data able to fit in t=
he largest level. It would be more accurate take into account data in small=
er levels (ie: using the geometric series equation) but this is a close eno=
ugh approximation. There is also the fact that redundant data might be spre=
ad across the various levels.
> Note 2: This represents the entropy introduction rate and does not accoun=
t for any Cassandra overhead but compression was also enabled. The row key =
was a long, each row had 512 columns, the column name was a UUID, and the c=
olumn value was a 64 byte blob.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)