Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Wed, 25 May 2016 08:57:13 +0000 (UTC)
From: "Sylvain Lebresne (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12971992.1463993498000.286111.1464166633061@Atlassian.JIRA>
In-Reply-To: <JIRA.12971992.1463993498000@Atlassian.JIRA>
References: <JIRA.12971992.1463993498000@Atlassian.JIRA> <JIRA.12971992.1463993498311@arcas>
Subject: [jira] [Commented] (CASSANDRA-11871) Allow to aggregate by time
 intervals
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
archived-at: Wed, 25 May 2016 08:57:15 -0000


    [ https://issues.apache.org/jira/browse/CASSANDRA-11871?page=3Dcom.atla=
ssian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=
=3D15299700#comment-15299700 ]=20

Sylvain Lebresne commented on CASSANDRA-11871:
----------------------------------------------

I would truly hate to limit this to a single time-related function, so I *r=
eally* think we should re-phrase this ticket as adding support for function=
s (as generally as possible, see below) in {{GROUP BY}}.

In practice, there is obviously limitations to what functions we can suppor=
t in {{GROUP BY}}. Informally, we need the {{GROUP BY}} to "respect" the cl=
ustering order even when functions are applied. And I believe that for that=
, we can use the following sufficient and reasonably simple to check condit=
ions:
# We only allow functions on a single column. That is, {{GROUP BY a, f(b)}}=
 might be ok (if the other conditions hold), but {{GROUP BY f(a, b)}} won=
=E2=80=99t be.
# If you ignore the function calls, the {{GROUP BY}} should still be valid =
(in the sense of CASSANDRA-10707). That is, {{GROUP BY f(a), g(b)}} would o=
nly be valid if {{GROUP BY a, b}} is.
# We only allow functions in {{GROUP BY}} that are monotonic.

I'll note that those conditions are _sufficient_ (if I'm not missing someth=
ing) but they may not be _necessary_ (in other words, there is cases that w=
ill be rejected by those conditions but coudl still "respect" the cluster o=
rder). This is ok though because those rules are general enough to be usefu=
l imo. The main point is that they are simple to check.

On the 3rd point though, validating that a function is monotic automaticall=
y is not at all easy and I'm not suggesting we do so. What I am suggesting =
is that we simply manually mark functions that are know to be monotonic (ve=
ry concretely, we'll add a {{isMonotonic}} method to the {{Function}} class=
). And initially, we'll only mark "native" functions as such, UDF will also=
 have this method return {{false}}, though we could in the future allow use=
rs to declare their UDF as monotonic if we so wish.

Of course, I fully agree that we also want to add a {{floor}}-like function=
 as suggested above.

bq. An important aspect to keep in mind with a function like floor is the s=
tarting point.

I'm not exactly sure how that would work to be honest. First because, if I =
understand what you're saying, it means the result of the floor method depe=
nds on what is in the {{WHERE}} clause, and not only does that sound like a=
 mess to implement, but how would that work if I do:
{noformat}
SELECT floor(time, 3h), count() FROM foo GROUP BY floor(time, 3h);
{noformat}
When is my "starting point" then? Unless you mean that the {{floor}} functi=
on has some per-partition state and the starting point is based on the firs=
t value it receives, but that's even worth because that would mean the retu=
rning intervals are not predictable.

In fact, even if have a {{WHERE}}, I could have (assuming CASSANDRA-11873):
{noformat}
SELECT floor(time, 3h), count() FROM foo WHERE time > now() - 10h GROUP BY =
floor(time, 3h);
{noformat}
in which case I probably don't want my intervals to be based on the value o=
f {{now()}}.

Am I missing something obvious?

Anyway, I really think the {{floor}} function should be a normal function, =
that only depends on its arguments. I do like the flexibility of being able=
 to define precise intervals, and I suppose we should provide the starting =
point as argument of the method. Meaning that general form of the function =
would be {{floor(value, range, starting_point)}} and would return the bigge=
st time {{t}} such that {{t < value}} and such that {{t =3D starting_point =
+ n * range}} for some integer {{n}}. Assuming we can provide that, we coul=
d just also have {{floor(value, range)}} as just a shortcut using a hard-co=
ded {{starting_point}} (some Jan 1 at 0:00). But that said, I'm not sure su=
ch method can be computed efficiently, so not sure how viable that is.


> Allow to aggregate by time intervals
> ------------------------------------
>
>                 Key: CASSANDRA-11871
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1187=
1
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: CQL
>            Reporter: Benjamin Lerer
>            Assignee: Benjamin Lerer
>             Fix For: 3.x
>
>
> For time series data it can be usefull to aggregate by time intervals.
> The idea would be to add support for one or several functions in the {{GR=
OUP BY}} clause.
> Regarding the implementation, even if in general I also prefer to follow =
the SQL syntax, I do not believe it will be a good fit for Cassandra.
> If we have a table like:
> {code}
> CREATE TABLE trades
> {
>     symbol text,
>     date date,
>     time time,
>     priceMantissa int,
>     priceExponent tinyint,
>     volume int,
>     PRIMARY KEY ((symbol, date), time)
> };
> {code}
> The trades will be inserted with an increasing time and sorted in the sam=
e order. As we can have to process a large amount of data, we want to try t=
o limit ourself to the cases where we can build the groups on the flight (w=
hich is not a requirement in the SQL world).
> If we want to get the number of trades per minutes with the SQL syntax we=
 will have to write:
> {{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol =3D '=
AAPL' AND date =3D '2016-01-11' GROUP BY hour(time), minute(time);}}
> which is fine. The problem is that if the user invert by mistake the func=
tions like that:
> {{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol =3D '=
AAPL' AND date =3D '2016-01-11' GROUP BY minute(time), hour(time);}}
> the query will return weird results.
> The only way to prevent that would be to check the function order and mak=
e sure that we do not allow to skip functions (e.g. {{GROUP BY hour(time), =
second(time)}}).
> In my opinion a function like {{floor(<columnName>, <time range>)}} will =
be much better as it does not allow for this type of mistakes and is much m=
ore flexible (you can create 5 minutes buckets if you want to).
> {{SELECT floor(time, m), count() FROM Trades WHERE symbol =3D 'AAPL' AND =
date =3D '2016-01-11' GROUP BY floor(time, m);}}
> An important aspect to keep in mind with a function like {{floor}} is the=
 starting point. For a query like:  {{SELECT floor(time, m), count() FROM T=
rades WHERE symbol =3D 'AAPL' AND date =3D '2016-01-11' AND time >=3D '01:3=
0:00' AND time =3D< '07:30:00' GROUP BY floor(time, 2h);}}, I think that id=
eally the result should return 3 groups: {{01:30:00}}, {{03:30:00}} and {{0=
5:30:00}}. =20
> =20


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)