cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Björn Hegerfors (JIRA) <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-6602) Compaction improvements to optimize time series data
Date Thu, 21 Aug 2014 21:28:13 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14105970#comment-14105970
] 

Björn Hegerfors edited comment on CASSANDRA-6602 at 8/21/14 9:26 PM:
---------------------------------------------------------------------

Sorry about not getting back with results earlier. Here, I attatch a tool that will print
min/max timestamps of each input SSTable, and calculate their overlaps.

Along with that are the output that TimestampViewer gave after 1 week and after about 8 weeks,
since the write survey mode node was started with DTCS.

There were few hiccups with this test, because all timestamps are not in microseconds. It
turned out that data in this particular cluster used to be written with microsecond timestamps,
but at one point it apparently started using milliseconds instead. For that reason, I had
to abort the first test and make a version of DTCS that would convert any timestamp into microseconds
(making assumptions about which year it's running, of course).

That fixed the biggest problem, but the results are still somewhat affected by it. What has
happened, the week 8 output makes this clear, is that the biggest and oldest SSTable at this
point contains all of the microsecond timestamps and some of the millisecond timestamps. The
minimum timestamp, as that SSTable sees it, is one in milliseconds, but it actually contains
much older data that was written in microseconds. The maximum timestamp, as that file sees
it, is one in microseconds, but it actually contains more recent data in milliseconds. So
that one file simply lies about its time interval. Any newer SSTable (May 24 and on in the
8 week output) is unaffected by this!

The week 1 file seems to be from a point where this had not yet stabilized, so it may not
be of much value. Regardless, the week 8 output looks very good if you scroll down to the
bottom. The huge gap at the beginning is caused by the timestamp inconsistency, so what you
should really read out of this is that very nearly 100% of the whole timespan is "covered"
by only a single SSTable; overlaps are negligible.

Let me hear what you think about this output. Should it compact more to keep the number of
files lower, for instance? That is partly adjustable by compaction options of course, which
are all left at default in this case. Does the output give you a good idea of what's going
on? Ideally, I'd like to view a diagram of it.

EDIT: Oh, I forgot to demonstrate what TimestampViewer shows if you run STCS. The attached
file "STCS 16 hours.txt" just shows a simple non-production test that I ran on my laptop for
16 hours. It was a simple time series with 100 rows. The point is to show how horribly much
the min/max timestamps overlap. 30% of the timespan is "covered" by all 11 SSTables! If someone
wants to try TimestampViewer on a production cluster that has run STCS for a time series for
a long while, that would be useful to see. The output on LCS nodes doesn't look too useful.
Sadly, the cluster that I did my production test on used LCS, when DTCS is much more directly
comparable to STCS.


was (Author: bj0rn):
Sorry about not getting back with results earlier. Here, I attatch a tool that will print
min/max timestamps of each input SSTable, and calculate their overlaps.

Along with that are the output that TimestampViewer gave after 1 week and after about 8 weeks,
since the write survey mode node was started with DTCS.

There were few hiccups with this test, because all timestamps are not in microseconds. It
turned out that data in this particular cluster used to be written with microsecond timestamps,
but at one point it apparently started using milliseconds instead. For that reason, I had
to abort the first test and make a version of DTCS that would convert any timestamp into microseconds
(making assumptions about which year it's running, of course).

That fixed the biggest problem, but the results are still somewhat affected by it. What has
happened, the week 8 output makes this clear, is that the biggest and oldest SSTable at this
point contains all of the microsecond timestamps and some of the millisecond timestamps. The
minimum timestamp, as that SSTable sees it, is one in milliseconds, but it actually contains
much older data that was written in microseconds. The maximum timestamp, as that file sees
it, is one in microseconds, but it actually contains more recent data in milliseconds. So
that one file simply lies about its time interval. Any newer SSTable (May 24 and on in the
8 week output) is unaffected by this!

The week 1 file seems to be from a point where this had not yet stabilized, so it may not
be of much value. Regardless, the week 8 output looks very good if you scroll down to the
bottom. The huge gap at the beginning is caused by the timestamp inconsistency, so what you
should really read out of this is that very nearly 100% of the whole timespan is "covered"
by only a single SSTable; overlaps are negligible.

Let me hear what you think about this output. Should it compact more to keep the number of
files lower, for instance? That is partly adjustable by compaction options of course, which
are all left at default in this case. Does the output give you a good idea of what's going
on? Ideally, I'd like to view a diagram of it.

> Compaction improvements to optimize time series data
> ----------------------------------------------------
>
>                 Key: CASSANDRA-6602
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6602
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Tupshin Harper
>            Assignee: Björn Hegerfors
>              Labels: compaction, performance
>             Fix For: 3.0
>
>         Attachments: 1 week.txt, 8 weeks.txt, STCS 16 hours.txt, TimestampViewer.java,
cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v2.txt,
cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v3.txt
>
>
> There are some unique characteristics of many/most time series use cases that both provide
challenges, as well as provide unique opportunities for optimizations.
> One of the major challenges is in compaction. The existing compaction strategies will
tend to re-compact data on disk at least a few times over the lifespan of each data point,
greatly increasing the cpu and IO costs of that write.
> Compaction exists to
> 1) ensure that there aren't too many files on disk
> 2) ensure that data that should be contiguous (part of the same partition) is laid out
contiguously
> 3) deleting data due to ttls or tombstones
> The special characteristics of time series data allow us to optimize away all three.
> Time series data
> 1) tends to be delivered in time order, with relatively constrained exceptions
> 2) often has a pre-determined and fixed expiration date
> 3) Never gets deleted prior to TTL
> 4) Has relatively predictable ingestion rates
> Note that I filed CASSANDRA-5561 and this ticket potentially replaces or lowers the need
for it. In that ticket, jbellis reasonably asks, how that compaction strategy is better than
disabling compaction.
> Taking that to heart, here is a compaction-strategy-less approach that could be extremely
efficient for time-series use cases that follow the above pattern.
> (For context, I'm thinking of an example use case involving lots of streams of time-series
data with a 5GB per day ingestion rate, and a 1000 day retention with TTL, resulting in an
eventual steady state of 5TB per node)
> 1) You have an extremely large memtable (preferably off heap, if/when doable) for the
table, and that memtable is sized to be able to hold a lengthy window of time. A typical period
might be one day. At the end of that period, you flush the contents of the memtable to an
sstable and move to the next one. This is basically identical to current behaviour, but with
thresholds adjusted so that you can ensure flushing at predictable intervals. (Open question
is whether predictable intervals is actually necessary, or whether just waiting until the
huge memtable is nearly full is sufficient)
> 2) Combine the behaviour with CASSANDRA-5228 so that sstables will be efficiently dropped
once all of the columns have. (Another side note, it might be valuable to have a modified
version of CASSANDRA-3974 that doesn't bother storing per-column TTL since it is required
that all columns have the same TTL)
> 3) Be able to mark column families as read/write only (no explicit deletes), so no tombstones.
> 4) Optionally add back an additional type of delete that would delete all data earlier
than a particular timestamp, resulting in immediate dropping of obsoleted sstables.
> The result is that for in-order delivered data, Every cell will be laid out optimally
on disk on the first pass, and over the course of 1000 days and 5TB of data, there will "only"
be 1000 5GB sstables, so the number of filehandles will be reasonable.
> For exceptions (out-of-order delivery), most cases will be caught by the extended (24
hour+) memtable flush times and merged correctly automatically. For those that were slightly
askew at flush time, or were delivered so far out of order that they go in the wrong sstable,
there is relatively low overhead to reading from two sstables for a time slice, instead of
one, and that overhead would be incurred relatively rarely unless out-of-order delivery was
the common case, in which case, this strategy should not be used.
> Another possible optimization to address out-of-order would be to maintain more than
one time-centric memtables in memory at a time (e.g. two 12 hour ones), and then you always
insert into whichever one of the two "owns" the appropriate range of time. By delaying flushing
the ahead one until we are ready to roll writes over to a third one, we are able to avoid
any fragmentation as long as all deliveries come in no more than 12 hours late (in this example,
presumably tunable).
> Anything that triggers compactions will have to be looked at, since there won't be any.
The one concern I have is the ramificaiton of repair. Initially, at least, I think it would
be acceptable to just write one sstable per repair and not bother trying to merge it with
other sstables.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message