cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pavel Trukhanov (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-11871) Allow to aggregate by time intervals
Date Thu, 02 Feb 2017 14:39:51 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-11871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15849985#comment-15849985
] 

Pavel Trukhanov edited comment on CASSANDRA-11871 at 2/2/17 2:38 PM:
---------------------------------------------------------------------

Do I understand correctly that   {{SELECT minute(time), count() FROM Trades .. GROUP BY minute(time)}}
shouldn't be allowed because {{minute}} is not monotonic, while {{floor(dt, 1m)}} is? 
So is there any other functions other than {{floor}} and probably {{ceil}} that qualify to
be allowed in group by?


Also I think the right way to get only full "buckets" (groups) and not weird first and last
ones for {{SELECT count() FROM foo WHERE time > now() - 10h GROUP BY floor(time, 3h)}}
would be {{SELECT count() FROM foo WHERE time > floor(now() - 10h, 3h) GROUP BY floor(time,
3h)}} which looks ugly but right.

So this {{floor}} fn is more like date_trunc from Postgres
{{date_trunc(text, timestamp)}} — Truncate to specified precision {{date_trunc('hour', timestamp
'2001-02-16 20:38:40') -> '2001-02-16 20:00:00'}}. 

And I suggest that there should be plain {{trunc}} fn for integers etc. 




was (Author: pavel.trukhanov):
Do I understand correctly that   `SELECT minute(time), count() FROM Trades .. GROUP BY minute(time);`
shouldn't be allowed because `minute` is not monotonic, while `floor(dt, 1m)` is? 
So is there any other functions other than `floor` and `ceil` that qualify to be allowed in
group by?


Also I think the right way to get only full "buckets" (groups) and not weird first and last
ones for `SELECT count() FROM foo WHERE time > now() - 10h GROUP BY floor(time, 3h);` would
be `SELECT count() FROM foo WHERE time > floor(now() - 10h, 3h) GROUP BY floor(time, 3h);`
which looks ugly but right;

So this `floor` thing is more like date_trunc from Postgres
`date_trunc(text, timestamp)` — Truncate to specified precision `date_trunc('hour', timestamp
'2001-02-16 20:38:40') -> '2001-02-16 20:00:00'`. 



> Allow to aggregate by time intervals
> ------------------------------------
>
>                 Key: CASSANDRA-11871
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11871
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: CQL
>            Reporter: Benjamin Lerer
>            Assignee: Benjamin Lerer
>             Fix For: 3.x
>
>
> For time series data it can be usefull to aggregate by time intervals.
> The idea would be to add support for one or several functions in the {{GROUP BY}} clause.
> Regarding the implementation, even if in general I also prefer to follow the SQL syntax,
I do not believe it will be a good fit for Cassandra.
> If we have a table like:
> {code}
> CREATE TABLE trades
> {
>     symbol text,
>     date date,
>     time time,
>     priceMantissa int,
>     priceExponent tinyint,
>     volume int,
>     PRIMARY KEY ((symbol, date), time)
> };
> {code}
> The trades will be inserted with an increasing time and sorted in the same order. As
we can have to process a large amount of data, we want to try to limit ourself to the cases
where we can build the groups on the flight (which is not a requirement in the SQL world).
> If we want to get the number of trades per minutes with the SQL syntax we will have to
write:
> {{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' AND date
= '2016-01-11' GROUP BY hour(time), minute(time);}}
> which is fine. The problem is that if the user invert by mistake the functions like that:
> {{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' AND date
= '2016-01-11' GROUP BY minute(time), hour(time);}}
> the query will return weird results.
> The only way to prevent that would be to check the function order and make sure that
we do not allow to skip functions (e.g. {{GROUP BY hour(time), second(time)}}).
> In my opinion a function like {{floor(<columnName>, <time range>)}} will
be much better as it does not allow for this type of mistakes and is much more flexible (you
can create 5 minutes buckets if you want to).
> {{SELECT floor(time, m), count() FROM Trades WHERE symbol = 'AAPL' AND date = '2016-01-11'
GROUP BY floor(time, m);}}
> An important aspect to keep in mind with a function like {{floor}} is the starting point.
For a query like:  {{SELECT floor(time, m), count() FROM Trades WHERE symbol = 'AAPL' AND
date = '2016-01-11' AND time >= '01:30:00' AND time =< '07:30:00' GROUP BY floor(time,
2h);}}, I think that ideally the result should return 3 groups: {{01:30:00}}, {{03:30:00}}
and {{05:30:00}}.  
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message