Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Fri, 3 Oct 2014 18:34:39 +0000 (UTC)
From: "Jeremiah Jordan (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12689527.1389979981000.185362.1412361279696@Atlassian.JIRA>
In-Reply-To: <JIRA.12689527.1389979981000@Atlassian.JIRA>
References: <JIRA.12689527.1389979981000@Atlassian.JIRA>
 <JIRA.12689527.1389979981083@arcas>
Subject: [jira] [Comment Edited] (CASSANDRA-6602) Compaction improvements to
 optimize time series data
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/CASSANDRA-6602?page=3Dcom.atlas=
sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D=
14158316#comment-14158316 ]=20

Jeremiah Jordan edited comment on CASSANDRA-6602 at 10/3/14 6:33 PM:
---------------------------------------------------------------------

bq. A side effect of this is that if min_compaction_threshold is, say 4, yo=
u might have 1 big and 2 small SSTables from the last hour, the moment time=
 (more specifically, the "now" variable) crosses over to a new hour, then i=
t will stay uncompacted like that until those SSTables enter a bigger targe=
t. At that point, what you would have expected to be 4 similarly sized SSTa=
bles in that new target would rather be 3 similarly sized ones, 1 a tad sma=
ller, and 2 small ones (actually the same thing is likely to have happened =
during other hours too). But after that compaction happens, everything is l=
ike it should be**. I don't believe that it's a big deal. There are a coupl=
e ways around it. The (now/timeUnit) - 1 initial target approach fixes this=
. Another way would be to ignore min_compaction_threshold for anything beyo=
nd how I use it as a "base", or keeping base and min_compaction_threshold s=
eparate, letting you set min_compaction_threshold to 2. Does anyone have an=
y ideas about this?

I think using the min_compaction_threshold in the "current" area is good, s=
o you basically do STCS until its been long enough to "bucket" the sstable.=
  Then once something is in a time bucket, treat it as min_compaction_thres=
hold=3D2 so that each bucket only has one sstable in it.


was (Author: jjordan):
bq A side effect of this is that if min_compaction_threshold is, say 4, you=
 might have 1 big and 2 small SSTables from the last hour, the moment time =
(more specifically, the "now" variable) crosses over to a new hour, then it=
 will stay uncompacted like that until those SSTables enter a bigger target=
. At that point, what you would have expected to be 4 similarly sized SSTab=
les in that new target would rather be 3 similarly sized ones, 1 a tad smal=
ler, and 2 small ones (actually the same thing is likely to have happened d=
uring other hours too). But after that compaction happens, everything is li=
ke it should be**. I don't believe that it's a big deal. There are a couple=
 ways around it. The (now/timeUnit) - 1 initial target approach fixes this.=
 Another way would be to ignore min_compaction_threshold for anything beyon=
d how I use it as a "base", or keeping base and min_compaction_threshold se=
parate, letting you set min_compaction_threshold to 2. Does anyone have any=
 ideas about this?

I think using the min_compaction_threshold in the "current" area is good, s=
o you basically do STCS until its been long enough to "bucket" the sstable.=
  Then once something is in a time bucket, treat it as min_compaction_thres=
hold=3D2 so that each bucket only has one sstable in it.

> Compaction improvements to optimize time series data
> ----------------------------------------------------
>
>                 Key: CASSANDRA-6602
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6602
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Tupshin Harper
>            Assignee: Bj=C3=B6rn Hegerfors
>              Labels: compaction, performance
>             Fix For: 3.0
>
>         Attachments: 1 week.txt, 8 weeks.txt, STCS 16 hours.txt, Timestam=
pViewer.java, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy.txt=
, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v2.txt, cassand=
ra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v3.txt
>
>
> There are some unique characteristics of many/most time series use cases =
that both provide challenges, as well as provide unique opportunities for o=
ptimizations.
> One of the major challenges is in compaction. The existing compaction str=
ategies will tend to re-compact data on disk at least a few times over the =
lifespan of each data point, greatly increasing the cpu and IO costs of tha=
t write.
> Compaction exists to
> 1) ensure that there aren't too many files on disk
> 2) ensure that data that should be contiguous (part of the same partition=
) is laid out contiguously
> 3) deleting data due to ttls or tombstones
> The special characteristics of time series data allow us to optimize away=
 all three.
> Time series data
> 1) tends to be delivered in time order, with relatively constrained excep=
tions
> 2) often has a pre-determined and fixed expiration date
> 3) Never gets deleted prior to TTL
> 4) Has relatively predictable ingestion rates
> Note that I filed CASSANDRA-5561 and this ticket potentially replaces or =
lowers the need for it. In that ticket, jbellis reasonably asks, how that c=
ompaction strategy is better than disabling compaction.
> Taking that to heart, here is a compaction-strategy-less approach that co=
uld be extremely efficient for time-series use cases that follow the above =
pattern.
> (For context, I'm thinking of an example use case involving lots of strea=
ms of time-series data with a 5GB per day ingestion rate, and a 1000 day re=
tention with TTL, resulting in an eventual steady state of 5TB per node)
> 1) You have an extremely large memtable (preferably off heap, if/when doa=
ble) for the table, and that memtable is sized to be able to hold a lengthy=
 window of time. A typical period might be one day. At the end of that peri=
od, you flush the contents of the memtable to an sstable and move to the ne=
xt one. This is basically identical to current behaviour, but with threshol=
ds adjusted so that you can ensure flushing at predictable intervals. (Open=
 question is whether predictable intervals is actually necessary, or whethe=
r just waiting until the huge memtable is nearly full is sufficient)
> 2) Combine the behaviour with CASSANDRA-5228 so that sstables will be eff=
iciently dropped once all of the columns have. (Another side note, it might=
 be valuable to have a modified version of CASSANDRA-3974 that doesn't both=
er storing per-column TTL since it is required that all columns have the sa=
me TTL)
> 3) Be able to mark column families as read/write only (no explicit delete=
s), so no tombstones.
> 4) Optionally add back an additional type of delete that would delete all=
 data earlier than a particular timestamp, resulting in immediate dropping =
of obsoleted sstables.
> The result is that for in-order delivered data, Every cell will be laid o=
ut optimally on disk on the first pass, and over the course of 1000 days an=
d 5TB of data, there will "only" be 1000 5GB sstables, so the number of fil=
ehandles will be reasonable.
> For exceptions (out-of-order delivery), most cases will be caught by the =
extended (24 hour+) memtable flush times and merged correctly automatically=
. For those that were slightly askew at flush time, or were delivered so fa=
r out of order that they go in the wrong sstable, there is relatively low o=
verhead to reading from two sstables for a time slice, instead of one, and =
that overhead would be incurred relatively rarely unless out-of-order deliv=
ery was the common case, in which case, this strategy should not be used.
> Another possible optimization to address out-of-order would be to maintai=
n more than one time-centric memtables in memory at a time (e.g. two 12 hou=
r ones), and then you always insert into whichever one of the two "owns" th=
e appropriate range of time. By delaying flushing the ahead one until we ar=
e ready to roll writes over to a third one, we are able to avoid any fragme=
ntation as long as all deliveries come in no more than 12 hours late (in th=
is example, presumably tunable).
> Anything that triggers compactions will have to be looked at, since there=
 won't be any. The one concern I have is the ramificaiton of repair. Initia=
lly, at least, I think it would be acceptable to just write one sstable per=
 repair and not bother trying to merge it with other sstables.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)