Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 174DF200A5B for ; Wed, 25 May 2016 10:57:15 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 161D6160A17; Wed, 25 May 2016 08:57:15 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 38ABA160A2E for ; Wed, 25 May 2016 10:57:14 +0200 (CEST) Received: (qmail 48733 invoked by uid 500); 25 May 2016 08:57:13 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 48674 invoked by uid 99); 25 May 2016 08:57:13 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 May 2016 08:57:13 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 0FC792C1F6C for ; Wed, 25 May 2016 08:57:13 +0000 (UTC) Date: Wed, 25 May 2016 08:57:13 +0000 (UTC) From: "Sylvain Lebresne (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-11871) Allow to aggregate by time intervals MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 25 May 2016 08:57:15 -0000 [ https://issues.apache.org/jira/browse/CASSANDRA-11871?page=3Dcom.atla= ssian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId= =3D15299700#comment-15299700 ]=20 Sylvain Lebresne commented on CASSANDRA-11871: ---------------------------------------------- I would truly hate to limit this to a single time-related function, so I *r= eally* think we should re-phrase this ticket as adding support for function= s (as generally as possible, see below) in {{GROUP BY}}. In practice, there is obviously limitations to what functions we can suppor= t in {{GROUP BY}}. Informally, we need the {{GROUP BY}} to "respect" the cl= ustering order even when functions are applied. And I believe that for that= , we can use the following sufficient and reasonably simple to check condit= ions: # We only allow functions on a single column. That is, {{GROUP BY a, f(b)}}= might be ok (if the other conditions hold), but {{GROUP BY f(a, b)}} won= =E2=80=99t be. # If you ignore the function calls, the {{GROUP BY}} should still be valid = (in the sense of CASSANDRA-10707). That is, {{GROUP BY f(a), g(b)}} would o= nly be valid if {{GROUP BY a, b}} is. # We only allow functions in {{GROUP BY}} that are monotonic. I'll note that those conditions are _sufficient_ (if I'm not missing someth= ing) but they may not be _necessary_ (in other words, there is cases that w= ill be rejected by those conditions but coudl still "respect" the cluster o= rder). This is ok though because those rules are general enough to be usefu= l imo. The main point is that they are simple to check. On the 3rd point though, validating that a function is monotic automaticall= y is not at all easy and I'm not suggesting we do so. What I am suggesting = is that we simply manually mark functions that are know to be monotonic (ve= ry concretely, we'll add a {{isMonotonic}} method to the {{Function}} class= ). And initially, we'll only mark "native" functions as such, UDF will also= have this method return {{false}}, though we could in the future allow use= rs to declare their UDF as monotonic if we so wish. Of course, I fully agree that we also want to add a {{floor}}-like function= as suggested above. bq. An important aspect to keep in mind with a function like floor is the s= tarting point. I'm not exactly sure how that would work to be honest. First because, if I = understand what you're saying, it means the result of the floor method depe= nds on what is in the {{WHERE}} clause, and not only does that sound like a= mess to implement, but how would that work if I do: {noformat} SELECT floor(time, 3h), count() FROM foo GROUP BY floor(time, 3h); {noformat} When is my "starting point" then? Unless you mean that the {{floor}} functi= on has some per-partition state and the starting point is based on the firs= t value it receives, but that's even worth because that would mean the retu= rning intervals are not predictable. In fact, even if have a {{WHERE}}, I could have (assuming CASSANDRA-11873): {noformat} SELECT floor(time, 3h), count() FROM foo WHERE time > now() - 10h GROUP BY = floor(time, 3h); {noformat} in which case I probably don't want my intervals to be based on the value o= f {{now()}}. Am I missing something obvious? Anyway, I really think the {{floor}} function should be a normal function, = that only depends on its arguments. I do like the flexibility of being able= to define precise intervals, and I suppose we should provide the starting = point as argument of the method. Meaning that general form of the function = would be {{floor(value, range, starting_point)}} and would return the bigge= st time {{t}} such that {{t < value}} and such that {{t =3D starting_point = + n * range}} for some integer {{n}}. Assuming we can provide that, we coul= d just also have {{floor(value, range)}} as just a shortcut using a hard-co= ded {{starting_point}} (some Jan 1 at 0:00). But that said, I'm not sure su= ch method can be computed efficiently, so not sure how viable that is. > Allow to aggregate by time intervals > ------------------------------------ > > Key: CASSANDRA-11871 > URL: https://issues.apache.org/jira/browse/CASSANDRA-1187= 1 > Project: Cassandra > Issue Type: Improvement > Components: CQL > Reporter: Benjamin Lerer > Assignee: Benjamin Lerer > Fix For: 3.x > > > For time series data it can be usefull to aggregate by time intervals. > The idea would be to add support for one or several functions in the {{GR= OUP BY}} clause. > Regarding the implementation, even if in general I also prefer to follow = the SQL syntax, I do not believe it will be a good fit for Cassandra. > If we have a table like: > {code} > CREATE TABLE trades > { > symbol text, > date date, > time time, > priceMantissa int, > priceExponent tinyint, > volume int, > PRIMARY KEY ((symbol, date), time) > }; > {code} > The trades will be inserted with an increasing time and sorted in the sam= e order. As we can have to process a large amount of data, we want to try t= o limit ourself to the cases where we can build the groups on the flight (w= hich is not a requirement in the SQL world). > If we want to get the number of trades per minutes with the SQL syntax we= will have to write: > {{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol =3D '= AAPL' AND date =3D '2016-01-11' GROUP BY hour(time), minute(time);}} > which is fine. The problem is that if the user invert by mistake the func= tions like that: > {{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol =3D '= AAPL' AND date =3D '2016-01-11' GROUP BY minute(time), hour(time);}} > the query will return weird results. > The only way to prevent that would be to check the function order and mak= e sure that we do not allow to skip functions (e.g. {{GROUP BY hour(time), = second(time)}}). > In my opinion a function like {{floor(,