cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <>
Subject [jira] [Created] (CASSANDRA-8737) AdjacentDataCompactionStrategy
Date Wed, 04 Feb 2015 17:23:34 GMT
Benedict created CASSANDRA-8737:

             Summary: AdjacentDataCompactionStrategy
                 Key: CASSANDRA-8737
             Project: Cassandra
          Issue Type: Bug
          Components: Core
            Reporter: Benedict
            Assignee: Benedict
             Fix For: 3.0

In the original ticket for dealing with timeseries data that introduced DTCS, the first suggestion
was for an approach that compacted adjacent data (by clustering columns) together until a
single page (or some fixed multiple of pages) on average contained only one partition's worth
of data. The idea would be to compact any sstables that overlap their clustering components,
so that only one (or a fixed number) of sstables need to be queried for any clustering range.
The upshot of this would be tunable compaction burden to get optimal read behaviour, more
explicitly defined than the decay in DTCS. 

The basic idea would be to select boundary clustering prefixes based on the current data occupancy
within those ranges, falling roughly along the boundaries of the existing sstables, but so
that any overlapping tail falls one side or the other. We then compact all overlapping sstables,
and split the results into one side or another of the boundary (or across multiple boundaries).
If there are no historical updates, this gives pretty optimal behaviour; we only compact files
until we get to our packing threshold (so that reads are known to be at the configured efficiency),
and then stop. If updates to older records appear, they would be compacted into their boundary
buckets, and left there until we had enough files in a boundary (probably following normal
STCS rules) that it warranted compaction.

The benefit is that such historical updates are still accounted for and bounded by comparison
to DTCS, and the configuration parameters give more tunable characteristics, with explicit
expectations (i.e. one seek per X bytes read in a partition; higher X may imply more compaction,
lower more merges and seeks on read). It also may permit us some easy optimisations further
up the stack, since we can guarantee the boundaries of overlap.

This message was sent by Atlassian JIRA

View raw message