cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Philip Thompson (JIRA)" <>
Subject [jira] [Updated] (CASSANDRA-8737) AdjacentDataCompactionStrategy
Date Wed, 18 Mar 2015 21:57:38 GMT


Philip Thompson updated CASSANDRA-8737:
    Issue Type: New Feature  (was: Bug)

> AdjacentDataCompactionStrategy
> ------------------------------
>                 Key: CASSANDRA-8737
>                 URL:
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Benedict
>             Fix For: 3.0
> In the original ticket for dealing with timeseries data that introduced DTCS, the first
suggestion was for an approach that compacted adjacent data (by clustering columns) together
until a single page (or some fixed multiple of pages) on average contained only one partition's
worth of data. The idea would be to compact any sstables that overlap their clustering components,
so that only one (or a fixed number) of sstables need to be queried for any clustering range.
The upshot of this would be tunable compaction burden to get optimal read behaviour, more
explicitly defined than the decay in DTCS. 
> The basic idea would be to select boundary clustering prefixes based on the current data
occupancy within those ranges, falling roughly along the boundaries of the existing sstables,
but so that any overlapping tail falls one side or the other. We then compact all overlapping
sstables, and split the results into one side or another of the boundary (or across multiple
boundaries). If there are no historical updates, this gives pretty optimal behaviour; we only
compact files until we get to our packing threshold (so that reads are known to be at the
configured efficiency), and then stop. If updates to older records appear, they would be compacted
into their boundary buckets, and left there until we had enough files in a boundary (probably
following normal STCS rules) that it warranted compaction.
> The benefit is that such historical updates are still accounted for and bounded by comparison
to DTCS, and the configuration parameters give more tunable characteristics, with explicit
expectations (i.e. one seek per X bytes read in a partition; higher X may imply more compaction,
lower more merges and seeks on read). It also may permit us some easy optimisations further
up the stack, since we can guarantee the boundaries of overlap.

This message was sent by Atlassian JIRA

View raw message