Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E0299175CB for ; Fri, 3 Oct 2014 16:30:37 +0000 (UTC) Received: (qmail 64783 invoked by uid 500); 3 Oct 2014 16:30:37 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 64747 invoked by uid 500); 3 Oct 2014 16:30:37 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 64736 invoked by uid 99); 3 Oct 2014 16:30:37 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Oct 2014 16:30:37 +0000 Date: Fri, 3 Oct 2014 16:30:37 +0000 (UTC) From: =?utf-8?Q?Bj=C3=B6rn_Hegerfors_=28JIRA=29?= To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (CASSANDRA-6602) Compaction improvements to optimize time series data MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-6602?page=3Dcom.atlas= sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D= 14158131#comment-14158131 ]=20 Bj=C3=B6rn Hegerfors edited comment on CASSANDRA-6602 at 10/3/14 4:29 PM: --------------------------------------------------------------------- OK, so a writeup on some implementation details of this strategy might be i= n place. Maybe there's a better place to put something as long as this. But first, thanks for the patch Marcus! Everything looks good. Now, Marcus mentioned a worry about sliding windows as far as SSTable candi= date selection goes. Or at least that's what it sounded like. That was inde= ed one of my main concerns when I started implementing DTCS. After all, tim= e is constantly moving, so how do we avoid this situation where some SSTabl= es compact and then immediately slide over into some "SSTables older than x= " time window and need to compact again. Or maybe before the first compacti= on, a few of the oldest of those SSTables managed to slip over already, cau= sing the window for bigger SSTables to reach more than min_compaction_thres= hold and triggering a compaction with mixed size SSTables. Or something els= e like that. It can easily get messy, so I aimed for an algorithm with as little of a "s= liding windows" effect as possible. What I came up with is really all summe= d up by the Target class. A target represents a time span between Unix epoc= h* and now. An SSTable is "on target" if its min timestamp is within a give= n target's time span. This is the first decision that I would like some input on. My thinking was= that if there's a big SSTable with timestamps ranging from old to new, pic= king max timestamp would've made this data participate in too frequent comp= actions, because SSTables considered new compact more often than those cons= idered old. STCS creates these kinds of SSTables, if you run DTCS from the = start, it won't matter what timestamp you chose as the representative times= tamp of an SSTable. So choosing min timestamp was for improved compatibilit= y with STCS, essentially. How these targets are chosen is the most interesting part. I tend to think = targets backwards in time. The first target is the one covering most recent= timestamps. The most naive idea would be to let the first target be the la= st hour, the next target the hour before, and so on for 24 hours. Then targ= et #25 would be the day from 24 hours ago to 48 hours ago. 6 of those would= get us a week back. Then the same thing for months and years, etc. This ap= proach has a number of issues. First, it needs a concept of a calendar that= I would rather not hard code as everybody won't want it that way. Plus, it= would compact by very varying factors like 24, 7 and 4 (but a month is not= exactly 4 weeks, except Fabruary, sometimes... ugh!). So I quickly ditched= the calendar for a configurable factor, called "base", and an initial targ= et size that I call "time_unit". You could set base to 3 and timeUnit to an= hour for example, which should be expressed in microseconds (if microsecon= ds is what your timestamps use. I recommend sticking with microseconds, whi= ch appears to be a CQL standard. Timestamp inconsistencies mess with DTCS, = which I've involuntarily experienced in my tests). Then the calendar-less v= ersion of the above would pick last hour as the initial target and 2 more f= or the hours before that. Then before that, 2 3-hour targets, then 2 9-hour= targets, etc. I then noticed that "base" mostly duplicated the role of min= _compaction_threshold, so I removed it in favor of that. Any input on that = choice? The above solution is similar to my final algorithm, but if you noticed my = wording, you'll see that the above suffers from the "sliding windows" issue= . It's most probably a bad idea to use something like "last hour" or "betwe= en 18 hours ago and 9 hours ago" as targets, since those would be moving ta= rgets. Some integer arithmetic fit perfectly as a way around this. Let "now= " be the current time (I get it by taking the greatest max timestamp across= all SSTables, any input on that?). Let's keep timeUnit at an hour and base= (min_compaction_threshold) at 3. Using integer division, now/timeUnit will= yield the same value for an hour. Now we can use that as our first target.= Unless we recently passed an hour threshold, (now - 20 seconds)/timeUnit w= ill have the same result. We can't simply use the old approach of making 2 = single-hour targets before that, etc. because that will just lead to the sa= me sliding windows, but with considerably lower resolution (certainly bette= r than before, though). Rather, consider what happens if you integer divide= this initial target with base. Then we get a new target representing the 3= hour time span, perfectly aligned on 3 hour borders, that contains the cur= rent hour. The current hour could be either the 1st, 2nd or 3rd one of thos= e hours (nothing in-between). You can continue like that, dividing by base = again to get the perfectly aligned 9 hour time span containing the aforemen= tioned 3 hour span in one of three slots. This could be the sequence of targets to use. I chose not to use this, and = this is again somewhere I could use a second opinion. Instead, I let the in= itial target t0 =3D now/timeUnit (integer division is implicit), just like = before, but the next target is t1 =3D (t0/base) - 1. That is, the perfectly= aligned 3 hour time span right before the one that t0 was in. Then t2 =3D = (t1/base) - 1 and so on. With this change, new SSTables aren't candidates f= or all targets after the first one they "hit", as it would have been withou= t this last change. Now we're almost there. The first odd thing about this sequence of targets = is that there are now gaps between targets. If t0 is not the first hour of = a perfectly aligned 3 hour time span (in integer arithmetic terms: if (t0 %= base > 0)), then the next target t1 (a 3 hour span), will not end right be= fore t0. I thought it made sense to fill in those gaps. So in those cases, = when (tn % base > 0), I did it in the simplest way, by letting the next tar= get t(n+1) have the same size as the current target, but moved one step ear= lier. Essentially t(n+1) =3D tn - 1. This is exactly the way DTCS is implemented right now. Whenever at least mi= n_compaction_threshold SSTables hit the same target, they get submitted for= compaction. If multiple targets succeed, then first (i.e. latest) one is p= icked. In the first patch that I submitted here, there was one thing that w= as done differently. Then the initial target was (now/timeUnit) - 1, meanin= g that the current hour was not included in any target. I'm no expert at wh= ich of these is best, but difference is that the approach of that patch let= new SSTables queue up until we passed an hour (or whatever timeUnit is set= to) border. Then they would all compact at once. The current implementatio= n favors compacting and recompacting them as much as possible for the first= hour (but not if they're less than min_compaction_threshold). A side effect of this is that if min_compaction_threshold is, say 4, you mi= ght have 1 big and 2 small SSTables from the last hour, the moment time (mo= re specifically, the "now" variable) crosses over to a new hour, then it wi= ll stay uncompacted like that until those SSTables enter a bigger target. A= t that point, what you would have expected to be 4 similarly sized SSTables= in that new target would rather be 3 similarly sized ones, 1 a tad smaller= , and 2 small ones (actually the same thing is likely to have happened duri= ng other hours too). But after that compaction happens, everything is like = it should be**. I don't believe that it's a big deal. There are a couple wa= ys around it. The (now/timeUnit) - 1 initial target approach fixes this. An= other way would be to ignore min_compaction_threshold for anything beyond h= ow I use it as a "base", or keeping base and min_compaction_threshold separ= ate, letting you set min_compaction_threshold to 2. Does anyone have any id= eas about this? >From what I remember right now, the last tweak that I've been thinking abou= t is the following. Let's use our base=3D3, timeUnit=3Dhour example from be= fore. If we have (t0 % base =3D=3D 0), we know that we're at a 3 hour borde= r. Right now that means that the next target t1 =3D (t0/base) - 1. But what= if we're actually also at a 9 hour border? Then we could make the next tar= get 9 times bigger and put is as (t0/(base^2)) - 1 instead. But we might ev= en be at a 27 hour border. I hope you see where this is going? Using this k= nowledge, we could more aggressively compact SSTables up to bigger sizes at= an earlier point. That tweak might be worth considering. I'll end this detailed explanation with a simple visualization that might h= elp in getting some intuition about how the targets "move". Here, base is 4= and we let time go from time 0 to 32. On each line, one timeUnit has passe= d, what you see is a list of targets and which timestamps they cover (after= division by timeUnit). Sliding windows approach: [[0]] [[0],[1]] [[0],[1],[2]] [[0],[1],[2],[3]] [[0],[1],[2],[3],[4]] [[0,1],[2],[3],[4],[5]] [[0,1,2],[3],[4],[5],[6]] [[0,1,2,3],[4],[5],[6],[7]] [[0],[1,2,3,4],[5],[6],[7],[8]] [[0,1],[2,3,4,5],[6],[7],[8],[9]] [[0,1,2],[3,4,5,6],[7],[8],[9],[10]] [[0,1,2,3],[4,5,6,7],[8],[9],[10],[11]] [[0],[1,2,3,4],[5,6,7,8],[9],[10],[11],[12]] [[0,1],[2,3,4,5],[6,7,8,9],[10],[11],[12],[13]] [[0,1,2],[3,4,5,6],[7,8,9,10],[11],[12],[13],[14]] [[0,1,2,3],[4,5,6,7],[8,9,10,11],[12],[13],[14],[15]] [[0],[1,2,3,4],[5,6,7,8],[9,10,11,12],[13],[14],[15],[16]] [[0,1],[2,3,4,5],[6,7,8,9],[10,11,12,13],[14],[15],[16],[17]] [[0,1,2],[3,4,5,6],[7,8,9,10],[11,12,13,14],[15],[16],[17],[18]] [[0,1,2,3],[4,5,6,7],[8,9,10,11],[12,13,14,15],[16],[17],[18],[19]] [[0,1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16],[17],[18],[19],[20]] [[0,1,2,3,4,5],[6,7,8,9],[10,11,12,13],[14,15,16,17],[18],[19],[20],[21]] [[0,1,2,3,4,5,6],[7,8,9,10],[11,12,13,14],[15,16,17,18],[19],[20],[21],[22]= ] [[0,1,2,3,4,5,6,7],[8,9,10,11],[12,13,14,15],[16,17,18,19],[20],[21],[22],[= 23]] [[0,1,2,3,4,5,6,7,8],[9,10,11,12],[13,14,15,16],[17,18,19,20],[21],[22],[23= ],[24]] [[0,1,2,3,4,5,6,7,8,9],[10,11,12,13],[14,15,16,17],[18,19,20,21],[22],[23],= [24],[25]] [[0,1,2,3,4,5,6,7,8,9,10],[11,12,13,14],[15,16,17,18],[19,20,21,22],[23],[2= 4],[25],[26]] [[0,1,2,3,4,5,6,7,8,9,10,11],[12,13,14,15],[16,17,18,19],[20,21,22,23],[24]= ,[25],[26],[27]] [[0,1,2,3,4,5,6,7,8,9,10,11,12],[13,14,15,16],[17,18,19,20],[21,22,23,24],[= 25],[26],[27],[28]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13],[14,15,16,17],[18,19,20,21],[22,23,24,25= ],[26],[27],[28],[29]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14],[15,16,17,18],[19,20,21,22],[23,24,25= ,26],[27],[28],[29],[30]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24,25= ,26,27],[28],[29],[30],[31]] [[0],[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],[17,18,19,20],[21,22,23,24],[= 25,26,27,28],[29],[30],[31],[32]] Current implementation: [[0]] [[0],[1]] [[0],[1],[2]] [[0],[1],[2],[3]] [[0,1,2,3],[4]] [[0,1,2,3],[4],[5]] [[0,1,2,3],[4],[5],[6]] [[0,1,2,3],[4],[5],[6],[7]] [[0,1,2,3],[4,5,6,7],[8]] [[0,1,2,3],[4,5,6,7],[8],[9]] [[0,1,2,3],[4,5,6,7],[8],[9],[10]] [[0,1,2,3],[4,5,6,7],[8],[9],[10],[11]] [[0,1,2,3],[4,5,6,7],[8,9,10,11],[12]] [[0,1,2,3],[4,5,6,7],[8,9,10,11],[12],[13]] [[0,1,2,3],[4,5,6,7],[8,9,10,11],[12],[13],[14]] [[0,1,2,3],[4,5,6,7],[8,9,10,11],[12],[13],[14],[15]] [[0,1,2,3],[4,5,6,7],[8,9,10,11],[12,13,14,15],[16]] [[0,1,2,3],[4,5,6,7],[8,9,10,11],[12,13,14,15],[16],[17]] [[0,1,2,3],[4,5,6,7],[8,9,10,11],[12,13,14,15],[16],[17],[18]] [[0,1,2,3],[4,5,6,7],[8,9,10,11],[12,13,14,15],[16],[17],[18],[19]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20],[21]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20],[21],[22]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20],[21],[22],[23]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24],[= 25]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24],[= 25],[26]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24],[= 25],[26],[27]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24,25= ,26,27],[28]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24,25= ,26,27],[28],[29]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24,25= ,26,27],[28],[29],[30]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24,25= ,26,27],[28],[29],[30],[31]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24,25= ,26,27],[28,29,30,31],[32]] More aggressive tweak: [[0]] [[0],[1]] [[0],[1],[2]] [[0],[1],[2],[3]] [[0,1,2,3],[4]] [[0,1,2,3],[4],[5]] [[0,1,2,3],[4],[5],[6]] [[0,1,2,3],[4],[5],[6],[7]] [[0,1,2,3],[4,5,6,7],[8]] [[0,1,2,3],[4,5,6,7],[8],[9]] [[0,1,2,3],[4,5,6,7],[8],[9],[10]] [[0,1,2,3],[4,5,6,7],[8],[9],[10],[11]] [[0,1,2,3],[4,5,6,7],[8,9,10,11],[12]] [[0,1,2,3],[4,5,6,7],[8,9,10,11],[12],[13]] [[0,1,2,3],[4,5,6,7],[8,9,10,11],[12],[13],[14]] [[0,1,2,3],[4,5,6,7],[8,9,10,11],[12],[13],[14],[15]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16],[17]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16],[17],[18]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16],[17],[18],[19]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20],[21]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20],[21],[22]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20],[21],[22],[23]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24],[= 25]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24],[= 25],[26]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24],[= 25],[26],[27]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24,25= ,26,27],[28]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24,25= ,26,27],[28],[29]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24,25= ,26,27],[28],[29],[30]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24,25= ,26,27],[28],[29],[30],[31]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19,20,21,22,23,24,25,26,= 27,28,29,30,31],[32]] Hope that enlightens more than it blinds! It would be great to hear some fe= edback on this. More might be said about the effects of using DTCS, what it= 's good for, etc. But that's another topic. - * A possible extension to DTCS is to add an option (default =3D 0L) for = a time that should be considered time 0, rather than the Unix epoch. - ** Slight reservation, max_compaction_threshold should be at least min_co= mpaction_threshold^2, to be absolutely sure. was (Author: bj0rn): OK, so a writeup on some implementation details of this strategy might be i= n place. Maybe there's a better place to put something as long as this. But first, thanks for the patch Marcus! Everything looks good. Now, Marcus mentioned a worry about sliding windows as far as SSTable candi= date selection goes. Or at least that's what it sounded like. That was inde= ed one of my main concerns when I started implementing DTCS. After all, tim= e is constantly moving, so how do we avoid this situation where some SSTabl= es compact and then immediately slide over into some "SSTables older than x= " time window and need to compact again. Or maybe before the first compacti= on, a few of the oldest of those SSTables managed to slip over already, cau= sing the window for bigger SSTables to reach more than min_compaction_thres= hold and triggering a compaction with mixed size SSTables. Or something els= e like that. It can easily get messy, so I aimed for an algorithm with as little of a "s= liding windows" effect as possible. What I came up with is really all summe= d up by the Target class. A target represents a time span between Unix epoc= h* and now. An SSTable is "on target" if its min timestamp is within a give= n target's time span. This is the first decision that I would like some input on. My thinking was= that if there's a big SSTable with timestamps ranging from old to new, pic= king max timestamp would've made this data participate in too frequent comp= actions, because SSTables considered new compact more often than those cons= idered old. STCS creates these kinds of SSTables, if you run DTCS from the = start, it won't matter what timestamp you chose as the representative times= tamp of an SSTable. So choosing min timestamp was for improved compatibilit= y with STCS, essentially. How these targets are chosen is the most interesting part. I tend to think = targets backwards in time. The first target is the one covering most recent= timestamps. The most naive idea would be to let the first target be the la= st hour, the next target the hour before, and so on for 24 hours. Then targ= et #25 would be the day from 24 hours ago to 48 hours ago. 6 of those would= get us a week back. Then the same thing for months and years, etc. This ap= proach has a number of issues. First, it needs a concept of a calendar that= I would rather not hard code as everybody won't want it that way. Plus, it= would compact by very varying factors like 24, 7 and 4 (but a month is not= exactly 4 weeks, except Fabruary, sometimes... ugh!). So I quickly ditched= the calendar for a configurable factor, called "base", and an initial targ= et size that I call "time_unit". You could set base to 3 and timeUnit to an= hour for example, which should be expressed in microseconds (if microsecon= ds is what your timestamps use. I recommend sticking with microseconds, whi= ch appears to be a CQL standard. Timestamp inconsistencies mess with DTCS, = which I've involuntarily experienced in my tests). Then the calendar-less v= ersion of the above would pick last hour as the initial target and 2 more f= or the hours before that. Then before that, 2 3-hour targets, then 2 9-hour= targets, etc. I then noticed that "base" mostly duplicated the role of min= _compaction_threshold, so I removed it in favor of that. Any input on that = choice? The above solution is similar to my final algorithm, but if you noticed my = wording, you'll see that the above suffers from the "sliding windows" issue= . It's most probably a bad idea to use something like "last hour" or "betwe= en 18 hours ago and 9 hours ago" as targets, since those would be moving ta= rgets. Some integer arithmetic fit perfectly as a way around this. Let "now= " be the current time (I get it by taking the greatest max timestamp across= all SSTables, any input on that?). Let's keep timeUnit at an hour and base= (min_compaction_threshold) at 3. Using integer division, now/timeUnit will= yield the same value for an hour. Now we can use that as our first target.= Unless we recently passed an hour threshold, (now - 20 seconds)/timeUnit w= ill have the same result. We can't simply use the old approach of making 2 = single-hour targets before that, etc. because that will just lead to the sa= me sliding windows, but with considerably lower resolution (certainly bette= r than before, though). Rather, consider what happens if you integer divide= this initial target with base. Then we get a new target representing the 3= hour time span, perfectly aligned on 3 hour borders, that contains the cur= rent hour. The current hour could be either the 1st, 2nd or 3rd one of thos= e hours (nothing in-between). You can continue like that, dividing by base = again to get the perfectly aligned 9 hour time span containing the aforemen= tioned 3 hour span in one of three slots. This could be the sequence of targets to use. I chose not to use this, and = this is again somewhere I could use a second opinion. Instead, I let the in= itial target t0 =3D now/timeUnit (integer division is implicit), just like = before, but the next target is t1 =3D (t0/base) - 1. That is, the perfectly= aligned 3 hour time span right before the one that t0 was in. Then t2 =3D = (t1/base) - 1 and so on. With this change, new SSTables aren't candidates f= or all targets after the first one they "hit", as it would have been withou= t this last change. Now we're almost there. The first odd thing about this sequence of targets = is that there are now gaps between targets. If t0 is not the first hour of = a perfectly aligned 3 hour time span (in integer arithmetic terms: if (t0 %= base > 0)), then the next target t1 (a 3 hour span), will not end right be= fore t0. I thought it made sense to fill in those gaps. So in those cases, = when (tn % base > 0), I did it in the simplest way, by letting the next tar= get t(n+1) have the same size as the current target, but moved one step ear= lier. Essentially t(n+1) =3D tn - 1. This is exactly the way DTCS is implemented right now. Whenever at least mi= n_compaction_threshold SSTables hit the same target, they get submitted for= compaction. If multiple targets succeed, then first (i.e. latest) one is p= icked. In the first patch that I submitted here, there was one thing that w= as done differently. Then the initial target was (now/timeUnit) - 1, meanin= g that the current hour was not included in any target. I'm no expert at wh= ich of these is best, but difference is that the approach of that patch let= new SSTables queue up until we passed an hour (or whatever timeUnit is set= to) border. Then they would all compact at once. The current implementatio= n favors compacting and recompacting them as much as possible for the first= hour (but not if they're less than min_compaction_threshold). A side effect of this is that if min_compaction_threshold is, say 4, you mi= ght have 1 big and 2 small SSTables from the last hour, the moment time (mo= re specifically, the "now" variable) crosses over to a new hour, then it wi= ll stay uncompacted like that until those SSTables enter a bigger target. A= t that point, what you would have expected to be 4 similarly sized SSTables= in that new target would rather be 3 similarly sized ones, 1 a tad smaller= , and 2 small ones (actually the same thing is likely to have happened duri= ng other hours too). But after that compaction happens, everything is like = it should be**. I don't believe that it's a big deal. There are a couple wa= ys around it. The (now/timeUnit) - 1 initial target approach fixes this. An= other way would be to ignore min_compaction_threshold for anything beyond h= ow I use it as a "base", or keeping base and min_compaction_threshold separ= ate, letting you set min_compaction_threshold to 2. Does anyone have any id= eas about this? >From what I remember right now, the last tweak that I've been thinking abou= t is the following. Let's use our base=3D3, timeUnit=3Dhour example from be= fore. If we have (t0 % base =3D=3D 0), we know that we're at a 3 hour borde= r. Right now that means that the next target t1 =3D (t0/base) - 1. But what= if we're actually also at a 9 hour border? Then we could make the next tar= get 9 times bigger and put is as (t0/(base^2)) - 1 instead. But we might ev= en be at a 27 hour border. I hope you see where this is going? Using this k= nowledge, we could more aggressively compact SSTables up to bigger sizes at= an earlier point. That tweak might be worth considering. I'll end this detailed explanation with a simple visualization that might h= elp in getting some intuition about how the targets "move". Here, base is 4= and we let time go from time 0 to 20. On each line, one timeUnit has passe= d, what you see is a list of targets and which timestamps they cover (after= division by timeUnit). Sliding windows approach: [[0]] [[0],[1]] [[0],[1],[2]] [[0],[1],[2],[3]] [[0],[1],[2],[3],[4]] [[0,1],[2],[3],[4],[5]] [[0,1,2],[3],[4],[5],[6]] [[0,1,2,3],[4],[5],[6],[7]] [[0],[1,2,3,4],[5],[6],[7],[8]] [[0,1],[2,3,4,5],[6],[7],[8],[9]] [[0,1,2],[3,4,5,6],[7],[8],[9],[10]] [[0,1,2,3],[4,5,6,7],[8],[9],[10],[11]] [[0],[1,2,3,4],[5,6,7,8],[9],[10],[11],[12]] [[0,1],[2,3,4,5],[6,7,8,9],[10],[11],[12],[13]] [[0,1,2],[3,4,5,6],[7,8,9,10],[11],[12],[13],[14]] [[0,1,2,3],[4,5,6,7],[8,9,10,11],[12],[13],[14],[15]] [[0],[1,2,3,4],[5,6,7,8],[9,10,11,12],[13],[14],[15],[16]] [[0,1],[2,3,4,5],[6,7,8,9],[10,11,12,13],[14],[15],[16],[17]] [[0,1,2],[3,4,5,6],[7,8,9,10],[11,12,13,14],[15],[16],[17],[18]] [[0,1,2,3],[4,5,6,7],[8,9,10,11],[12,13,14,15],[16],[17],[18],[19]] [[0,1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16],[17],[18],[19],[20]] [[0,1,2,3,4,5],[6,7,8,9],[10,11,12,13],[14,15,16,17],[18],[19],[20],[21]] [[0,1,2,3,4,5,6],[7,8,9,10],[11,12,13,14],[15,16,17,18],[19],[20],[21],[22]= ] [[0,1,2,3,4,5,6,7],[8,9,10,11],[12,13,14,15],[16,17,18,19],[20],[21],[22],[= 23]] [[0,1,2,3,4,5,6,7,8],[9,10,11,12],[13,14,15,16],[17,18,19,20],[21],[22],[23= ],[24]] [[0,1,2,3,4,5,6,7,8,9],[10,11,12,13],[14,15,16,17],[18,19,20,21],[22],[23],= [24],[25]] [[0,1,2,3,4,5,6,7,8,9,10],[11,12,13,14],[15,16,17,18],[19,20,21,22],[23],[2= 4],[25],[26]] [[0,1,2,3,4,5,6,7,8,9,10,11],[12,13,14,15],[16,17,18,19],[20,21,22,23],[24]= ,[25],[26],[27]] [[0,1,2,3,4,5,6,7,8,9,10,11,12],[13,14,15,16],[17,18,19,20],[21,22,23,24],[= 25],[26],[27],[28]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13],[14,15,16,17],[18,19,20,21],[22,23,24,25= ],[26],[27],[28],[29]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14],[15,16,17,18],[19,20,21,22],[23,24,25= ,26],[27],[28],[29],[30]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24,25= ,26,27],[28],[29],[30],[31]] [[0],[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],[17,18,19,20],[21,22,23,24],[= 25,26,27,28],[29],[30],[31],[32]] Current implementation: [[0]] [[0],[1]] [[0],[1],[2]] [[0],[1],[2],[3]] [[0,1,2,3],[4]] [[0,1,2,3],[4],[5]] [[0,1,2,3],[4],[5],[6]] [[0,1,2,3],[4],[5],[6],[7]] [[0,1,2,3],[4,5,6,7],[8]] [[0,1,2,3],[4,5,6,7],[8],[9]] [[0,1,2,3],[4,5,6,7],[8],[9],[10]] [[0,1,2,3],[4,5,6,7],[8],[9],[10],[11]] [[0,1,2,3],[4,5,6,7],[8,9,10,11],[12]] [[0,1,2,3],[4,5,6,7],[8,9,10,11],[12],[13]] [[0,1,2,3],[4,5,6,7],[8,9,10,11],[12],[13],[14]] [[0,1,2,3],[4,5,6,7],[8,9,10,11],[12],[13],[14],[15]] [[0,1,2,3],[4,5,6,7],[8,9,10,11],[12,13,14,15],[16]] [[0,1,2,3],[4,5,6,7],[8,9,10,11],[12,13,14,15],[16],[17]] [[0,1,2,3],[4,5,6,7],[8,9,10,11],[12,13,14,15],[16],[17],[18]] [[0,1,2,3],[4,5,6,7],[8,9,10,11],[12,13,14,15],[16],[17],[18],[19]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20],[21]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20],[21],[22]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20],[21],[22],[23]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24],[= 25]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24],[= 25],[26]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24],[= 25],[26],[27]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24,25= ,26,27],[28]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24,25= ,26,27],[28],[29]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24,25= ,26,27],[28],[29],[30]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24,25= ,26,27],[28],[29],[30],[31]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24,25= ,26,27],[28,29,30,31],[32]] More aggressive tweak: [[0]] [[0],[1]] [[0],[1],[2]] [[0],[1],[2],[3]] [[0,1,2,3],[4]] [[0,1,2,3],[4],[5]] [[0,1,2,3],[4],[5],[6]] [[0,1,2,3],[4],[5],[6],[7]] [[0,1,2,3],[4,5,6,7],[8]] [[0,1,2,3],[4,5,6,7],[8],[9]] [[0,1,2,3],[4,5,6,7],[8],[9],[10]] [[0,1,2,3],[4,5,6,7],[8],[9],[10],[11]] [[0,1,2,3],[4,5,6,7],[8,9,10,11],[12]] [[0,1,2,3],[4,5,6,7],[8,9,10,11],[12],[13]] [[0,1,2,3],[4,5,6,7],[8,9,10,11],[12],[13],[14]] [[0,1,2,3],[4,5,6,7],[8,9,10,11],[12],[13],[14],[15]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16],[17]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16],[17],[18]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16],[17],[18],[19]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20],[21]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20],[21],[22]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20],[21],[22],[23]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24],[= 25]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24],[= 25],[26]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24],[= 25],[26],[27]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24,25= ,26,27],[28]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24,25= ,26,27],[28],[29]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24,25= ,26,27],[28],[29],[30]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19],[20,21,22,23],[24,25= ,26,27],[28],[29],[30],[31]] [[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],[16,17,18,19,20,21,22,23,24,25,26,= 27,28,29,30,31],[32]] Hope that enlightens more than it blinds! It would be great to hear some fe= edback on this. More might be said about the effects of using DTCS, what it= 's good for, etc. But that's another topic. - * A possible extension to DTCS is to add an option (default =3D 0L) for = a time that should be considered time 0, rather than the Unix epoch. - ** Slight reservation, max_compaction_threshold should be at least min_co= mpaction_threshold^2, to be absolutely sure. > Compaction improvements to optimize time series data > ---------------------------------------------------- > > Key: CASSANDRA-6602 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6602 > Project: Cassandra > Issue Type: New Feature > Components: Core > Reporter: Tupshin Harper > Assignee: Bj=C3=B6rn Hegerfors > Labels: compaction, performance > Fix For: 3.0 > > Attachments: 1 week.txt, 8 weeks.txt, STCS 16 hours.txt, Timestam= pViewer.java, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy.txt= , cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v2.txt, cassand= ra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v3.txt > > > There are some unique characteristics of many/most time series use cases = that both provide challenges, as well as provide unique opportunities for o= ptimizations. > One of the major challenges is in compaction. The existing compaction str= ategies will tend to re-compact data on disk at least a few times over the = lifespan of each data point, greatly increasing the cpu and IO costs of tha= t write. > Compaction exists to > 1) ensure that there aren't too many files on disk > 2) ensure that data that should be contiguous (part of the same partition= ) is laid out contiguously > 3) deleting data due to ttls or tombstones > The special characteristics of time series data allow us to optimize away= all three. > Time series data > 1) tends to be delivered in time order, with relatively constrained excep= tions > 2) often has a pre-determined and fixed expiration date > 3) Never gets deleted prior to TTL > 4) Has relatively predictable ingestion rates > Note that I filed CASSANDRA-5561 and this ticket potentially replaces or = lowers the need for it. In that ticket, jbellis reasonably asks, how that c= ompaction strategy is better than disabling compaction. > Taking that to heart, here is a compaction-strategy-less approach that co= uld be extremely efficient for time-series use cases that follow the above = pattern. > (For context, I'm thinking of an example use case involving lots of strea= ms of time-series data with a 5GB per day ingestion rate, and a 1000 day re= tention with TTL, resulting in an eventual steady state of 5TB per node) > 1) You have an extremely large memtable (preferably off heap, if/when doa= ble) for the table, and that memtable is sized to be able to hold a lengthy= window of time. A typical period might be one day. At the end of that peri= od, you flush the contents of the memtable to an sstable and move to the ne= xt one. This is basically identical to current behaviour, but with threshol= ds adjusted so that you can ensure flushing at predictable intervals. (Open= question is whether predictable intervals is actually necessary, or whethe= r just waiting until the huge memtable is nearly full is sufficient) > 2) Combine the behaviour with CASSANDRA-5228 so that sstables will be eff= iciently dropped once all of the columns have. (Another side note, it might= be valuable to have a modified version of CASSANDRA-3974 that doesn't both= er storing per-column TTL since it is required that all columns have the sa= me TTL) > 3) Be able to mark column families as read/write only (no explicit delete= s), so no tombstones. > 4) Optionally add back an additional type of delete that would delete all= data earlier than a particular timestamp, resulting in immediate dropping = of obsoleted sstables. > The result is that for in-order delivered data, Every cell will be laid o= ut optimally on disk on the first pass, and over the course of 1000 days an= d 5TB of data, there will "only" be 1000 5GB sstables, so the number of fil= ehandles will be reasonable. > For exceptions (out-of-order delivery), most cases will be caught by the = extended (24 hour+) memtable flush times and merged correctly automatically= . For those that were slightly askew at flush time, or were delivered so fa= r out of order that they go in the wrong sstable, there is relatively low o= verhead to reading from two sstables for a time slice, instead of one, and = that overhead would be incurred relatively rarely unless out-of-order deliv= ery was the common case, in which case, this strategy should not be used. > Another possible optimization to address out-of-order would be to maintai= n more than one time-centric memtables in memory at a time (e.g. two 12 hou= r ones), and then you always insert into whichever one of the two "owns" th= e appropriate range of time. By delaying flushing the ahead one until we ar= e ready to roll writes over to a third one, we are able to avoid any fragme= ntation as long as all deliveries come in no more than 12 hours late (in th= is example, presumably tunable). > Anything that triggers compactions will have to be looked at, since there= won't be any. The one concern I have is the ramificaiton of repair. Initia= lly, at least, I think it would be acceptable to just write one sstable per= repair and not bother trying to merge it with other sstables. -- This message was sent by Atlassian JIRA (v6.3.4#6332)