[ https://issues.apache.org/jira/browse/CASSANDRA7871?page=com.atlassian.jira.plugin.system.issuetabpanels:alltabpanel
]
Marcus Eriksson updated CASSANDRA7871:

Fix Version/s: (was: 2.0.13)
3.0
> Reduce compaction IO in LCS
> 
>
> Key: CASSANDRA7871
> URL: https://issues.apache.org/jira/browse/CASSANDRA7871
> Project: Cassandra
> Issue Type: New Feature
> Components: Core
> Reporter: Dan Hendry
> Assignee: Dan Hendry
> Fix For: 3.0
>
> Attachments: LeveledCompactionImprovement2.0.10.patch, experiment.png, levelmultiplier.png,
sstablesize.png
>
>
> I have found LCS to be superior to STCS in almost every way  except for the fact that
it requires significantly more IO (a well advertised property). In leveled compaction, L ~n+1~
is 10 times larger than L ~n~ so generally 1+10 sstables need to be compacted to promote one
sstable into the next level. For certain workloads, this practically this means only 1/(10+1)=9%
of the IO, specifically write IO, is doing ‘useful’ work.
> But why is each level 10 times larger? Why 10? Its a pretty looking number and all but
thats not a very good reason to choose it. If we chose 5 or even 2 we could reduce the ‘wasted’
io required to promote an sstable to the next level  of course at the expense of requiring
more levels. I have not been able to find justification for this choice in either cassandra
or leveldb itself. I would like to introduce a new parameter, the leveling multiplier, which
controls the desired size difference between L ~n~ and L ~n+1~.
> First and foremost, a little math. Lets assume we have a CF of a fixed size that is receiving
continuous new data (ie: data is expiring due to TTLs or is being overwritten). I believe
the number of levels required is approximately (see note 1):
> {noformat}data size = (sstable size)*(leveling multiplier)^(level count){noformat}
> Which, when solving for the level count, becomes:
> {noformat}level count = log((data size)/(sstable size))/log(leveling multiplier){noformat}
> The amount of compaction write IO required over the lifetime of a particular piece of
data (excluding compactions in L0) is:
> {noformat}write IO = (flush IO) + (promotion IO)*(level count)
> write IO = 1 + (1 + (level multiplier))*log((data size)/(sstable size))/log(leveling
multiplier){noformat}
> So ultimately, the the relationship between write IO and the level multiplier is f\(x)
= (1 + x)/log\(x) which is optimal at 3.59, or 4 if we round to the nearest integer. Also
note that write IO is proportional to log((data size)/(sstable size)) which suggests using
larger sstables would also reduce disk IO.
> As one final analytical step we can add the following term to approximate STC in L0 (which
is not actually how its implemented but should be close enough for moderate sstable sizes):
> {noformat}L0 write IO = max(0, floor(log((sstable size)/(flush size))/log(4))){noformat}
> The following two graphs illustrate the predicted compaction requirements as a function
of the leveling multiplier and sstable size:
> !levelmultiplier.png!!sstablesize.png!
> In terms of empirically verifying the expected results, I set up three cassandra nodes,
node A having a leveling multiplier of 10 and sstable size if 160 MB (current cassandra defaults),
node B with multiplier 4 and size 160 MB, and node C with multiplier 4 and size 1024 MB. I
used a simple write only workload which inserted data having a TTL of 2 days at 1 MB/second
(see note 2). Compaction throttling was disabled and gc_grace was 60 seconds. All nodes had
dedicated data disks and IO measurements were for the data disks only.
> !experiment.png!
> MeasureNode A (10, 160MB)Node B (4, 160MB)Node C (4, 1024MB)
> Predicted IO Rate34.4 MB/s26.2 MB/s20.5 MB/s
> Predicted Improvementn/a23.8%40.4%
> Predicted Number of Levels (Expected Dataset of 169 GB)3.05.03.7
> Experimental IO Rate32.0 MB/s28.0 MB/s20.4 MB/s
> Experimental Improvementn/a12.4%*36.3%*
> Experimental Number of Levels~4.1~6.1~4.8
> Final Dataset Size (After 88 hours)301 GB261 GB258 GB
> These results indicate that Node A performed better than expected, I suspect that this
was due to the fact that the data insertion rate was a little too high and compaction periodically
got backlogged meaning the promotion from L0 to L1 was more efficient. Also note that the
actual dataset size is larger than that used in the analytical model  which is expected as
expired data will not get purged immediately. The size difference between node A and the others
however seems suspicious to me.
> In summary, these results, both theoretical and experimental, clearly indicate that reducing
the level multiplier from 10 to 4 and increasing the sstable size reduces compaction IO. The
experimental results, using an SSTable size of 1024 MB and level multiplier of 4, demonstrated
a 36% reduction in write IO without a significant increase in the number of levels. I have
not run benchmarks for an update heavy workload but I suspect it would benefit significantly
since more data can be ‘updated’ per compaction. I have also not benchmarked read performance
but I would not expect noticeable performance degradation provided an sstable size is chosen
which keeps the number of levels roughly equal.
> The patch I have attached is against 2.0.10 and does not change the defaults. Long term
however, it would make sense to use more optimal defaults unless there is compelling counter
evidence to the performance gains observed.
> One final observation, in current leveled compaction the number of levels is determined
by the amount of data and the user specified sstable size. A compaction strategy where instead
the user selected the desired number of levels and the strategy adjusted the SSTable size
based on the amount of data would have a number of benefits. The strategy would behave more
consistently across a much wider range of dataset sizes. Compaction IO overhead (as a function
of write rate) and worst case read performance (number of sstables per read) would both be
largely independent of dataset size.
> Note 1: This equation only calculates the amount of data able to fit in the largest level.
It would be more accurate take into account data in smaller levels (ie: using the geometric
series equation) but this is a close enough approximation. There is also the fact that redundant
data might be spread across the various levels.
> Note 2: This represents the entropy introduction rate and does not account for any Cassandra
overhead but compression was also enabled. The row key was a long, each row had 512 columns,
the column name was a UUID, and the column value was a 64 byte blob.

This message was sent by Atlassian JIRA
(v6.3.4#6332)
