Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0828417A1E for ; Fri, 3 Oct 2014 18:34:40 +0000 (UTC) Received: (qmail 79493 invoked by uid 500); 3 Oct 2014 18:34:39 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 79457 invoked by uid 500); 3 Oct 2014 18:34:39 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 79445 invoked by uid 99); 3 Oct 2014 18:34:39 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Oct 2014 18:34:39 +0000 Date: Fri, 3 Oct 2014 18:34:39 +0000 (UTC) From: "Jeremiah Jordan (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (CASSANDRA-6602) Compaction improvements to optimize time series data MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-6602?page=3Dcom.atlas= sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D= 14158316#comment-14158316 ]=20 Jeremiah Jordan edited comment on CASSANDRA-6602 at 10/3/14 6:33 PM: --------------------------------------------------------------------- bq. A side effect of this is that if min_compaction_threshold is, say 4, yo= u might have 1 big and 2 small SSTables from the last hour, the moment time= (more specifically, the "now" variable) crosses over to a new hour, then i= t will stay uncompacted like that until those SSTables enter a bigger targe= t. At that point, what you would have expected to be 4 similarly sized SSTa= bles in that new target would rather be 3 similarly sized ones, 1 a tad sma= ller, and 2 small ones (actually the same thing is likely to have happened = during other hours too). But after that compaction happens, everything is l= ike it should be**. I don't believe that it's a big deal. There are a coupl= e ways around it. The (now/timeUnit) - 1 initial target approach fixes this= . Another way would be to ignore min_compaction_threshold for anything beyo= nd how I use it as a "base", or keeping base and min_compaction_threshold s= eparate, letting you set min_compaction_threshold to 2. Does anyone have an= y ideas about this? I think using the min_compaction_threshold in the "current" area is good, s= o you basically do STCS until its been long enough to "bucket" the sstable.= Then once something is in a time bucket, treat it as min_compaction_thres= hold=3D2 so that each bucket only has one sstable in it. was (Author: jjordan): bq A side effect of this is that if min_compaction_threshold is, say 4, you= might have 1 big and 2 small SSTables from the last hour, the moment time = (more specifically, the "now" variable) crosses over to a new hour, then it= will stay uncompacted like that until those SSTables enter a bigger target= . At that point, what you would have expected to be 4 similarly sized SSTab= les in that new target would rather be 3 similarly sized ones, 1 a tad smal= ler, and 2 small ones (actually the same thing is likely to have happened d= uring other hours too). But after that compaction happens, everything is li= ke it should be**. I don't believe that it's a big deal. There are a couple= ways around it. The (now/timeUnit) - 1 initial target approach fixes this.= Another way would be to ignore min_compaction_threshold for anything beyon= d how I use it as a "base", or keeping base and min_compaction_threshold se= parate, letting you set min_compaction_threshold to 2. Does anyone have any= ideas about this? I think using the min_compaction_threshold in the "current" area is good, s= o you basically do STCS until its been long enough to "bucket" the sstable.= Then once something is in a time bucket, treat it as min_compaction_thres= hold=3D2 so that each bucket only has one sstable in it. > Compaction improvements to optimize time series data > ---------------------------------------------------- > > Key: CASSANDRA-6602 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6602 > Project: Cassandra > Issue Type: New Feature > Components: Core > Reporter: Tupshin Harper > Assignee: Bj=C3=B6rn Hegerfors > Labels: compaction, performance > Fix For: 3.0 > > Attachments: 1 week.txt, 8 weeks.txt, STCS 16 hours.txt, Timestam= pViewer.java, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy.txt= , cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v2.txt, cassand= ra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v3.txt > > > There are some unique characteristics of many/most time series use cases = that both provide challenges, as well as provide unique opportunities for o= ptimizations. > One of the major challenges is in compaction. The existing compaction str= ategies will tend to re-compact data on disk at least a few times over the = lifespan of each data point, greatly increasing the cpu and IO costs of tha= t write. > Compaction exists to > 1) ensure that there aren't too many files on disk > 2) ensure that data that should be contiguous (part of the same partition= ) is laid out contiguously > 3) deleting data due to ttls or tombstones > The special characteristics of time series data allow us to optimize away= all three. > Time series data > 1) tends to be delivered in time order, with relatively constrained excep= tions > 2) often has a pre-determined and fixed expiration date > 3) Never gets deleted prior to TTL > 4) Has relatively predictable ingestion rates > Note that I filed CASSANDRA-5561 and this ticket potentially replaces or = lowers the need for it. In that ticket, jbellis reasonably asks, how that c= ompaction strategy is better than disabling compaction. > Taking that to heart, here is a compaction-strategy-less approach that co= uld be extremely efficient for time-series use cases that follow the above = pattern. > (For context, I'm thinking of an example use case involving lots of strea= ms of time-series data with a 5GB per day ingestion rate, and a 1000 day re= tention with TTL, resulting in an eventual steady state of 5TB per node) > 1) You have an extremely large memtable (preferably off heap, if/when doa= ble) for the table, and that memtable is sized to be able to hold a lengthy= window of time. A typical period might be one day. At the end of that peri= od, you flush the contents of the memtable to an sstable and move to the ne= xt one. This is basically identical to current behaviour, but with threshol= ds adjusted so that you can ensure flushing at predictable intervals. (Open= question is whether predictable intervals is actually necessary, or whethe= r just waiting until the huge memtable is nearly full is sufficient) > 2) Combine the behaviour with CASSANDRA-5228 so that sstables will be eff= iciently dropped once all of the columns have. (Another side note, it might= be valuable to have a modified version of CASSANDRA-3974 that doesn't both= er storing per-column TTL since it is required that all columns have the sa= me TTL) > 3) Be able to mark column families as read/write only (no explicit delete= s), so no tombstones. > 4) Optionally add back an additional type of delete that would delete all= data earlier than a particular timestamp, resulting in immediate dropping = of obsoleted sstables. > The result is that for in-order delivered data, Every cell will be laid o= ut optimally on disk on the first pass, and over the course of 1000 days an= d 5TB of data, there will "only" be 1000 5GB sstables, so the number of fil= ehandles will be reasonable. > For exceptions (out-of-order delivery), most cases will be caught by the = extended (24 hour+) memtable flush times and merged correctly automatically= . For those that were slightly askew at flush time, or were delivered so fa= r out of order that they go in the wrong sstable, there is relatively low o= verhead to reading from two sstables for a time slice, instead of one, and = that overhead would be incurred relatively rarely unless out-of-order deliv= ery was the common case, in which case, this strategy should not be used. > Another possible optimization to address out-of-order would be to maintai= n more than one time-centric memtables in memory at a time (e.g. two 12 hou= r ones), and then you always insert into whichever one of the two "owns" th= e appropriate range of time. By delaying flushing the ahead one until we ar= e ready to roll writes over to a third one, we are able to avoid any fragme= ntation as long as all deliveries come in no more than 12 hours late (in th= is example, presumably tunable). > Anything that triggers compactions will have to be looked at, since there= won't be any. The one concern I have is the ramificaiton of repair. Initia= lly, at least, I think it would be acceptable to just write one sstable per= repair and not bother trying to merge it with other sstables. -- This message was sent by Atlassian JIRA (v6.3.4#6332)