cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From " Brian Hess (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-11871) Allow to aggregate by time intervals
Date Tue, 31 May 2016 13:31:12 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-11871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15307720#comment-15307720
] 

 Brian Hess commented on CASSANDRA-11871:
-----------------------------------------

It may not be totally important here, but the following are not the same thing:
1. SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' AND date = '2016-01-11'
GROUP BY minute(time);
2. SELECT floor(time, m), count() FROM Trades WHERE symbol = 'AAPL' AND date = '2016-01-11'
GROUP BY floor(time, m);

In the first we would get exactly 60 groups (one for every minute of the hour, assuming data
for every minute of the hour).  That is data for 12:01, 1:01, 2:01, etc would all be in this
bucket.
In the second we would get 1440 groups (one for every minute of the day, assuming data for
every minute of the day).  That is data for 12:01, 1:01, and 2:01 would each be in their own
groups.

Sometimes people will want to group by both the Hour and the Minute (and other combinations,
but I'm just following the example here) and get 1440 groups, and other times they really
will want to group by the Minute and get 60 groups.  Both use cases are necessary.

It is worth noting that the Minute (and Hour and Day, etc) functions are *not* monotonic.
 When it wraps around the value drops from 59 to 0 (or 11 to 0, or 28/30/31 to 1, etc).  The
Floor function, however, is monotonic because what you get is the full date/time truncated
to the desired resolution (no wrapping).

With respect to the starting ranges for things like "WHERE time > now - 10h GROUP BY Floor(time,
1h)", I think Benjamin is right with saying that if now is 12:21, that the starting point
is then 2:21 and the bucket starts 2:00.  The strange thing is that this first time bucket
(2:00) will only be partially filled since it will only have the data from 2:21-2:59, where
all the other buckets (except the 12:00 one, which only has data for 12:00-12:21) will be
full hours.  Of course, the proper way to specify that is probably "WHERE time > Floor(now
- 10h, 1h) GROUP BY Floor(time, 1h)" - will that be supported [~blerer]?

> Allow to aggregate by time intervals
> ------------------------------------
>
>                 Key: CASSANDRA-11871
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11871
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: CQL
>            Reporter: Benjamin Lerer
>            Assignee: Benjamin Lerer
>             Fix For: 3.x
>
>
> For time series data it can be usefull to aggregate by time intervals.
> The idea would be to add support for one or several functions in the {{GROUP BY}} clause.
> Regarding the implementation, even if in general I also prefer to follow the SQL syntax,
I do not believe it will be a good fit for Cassandra.
> If we have a table like:
> {code}
> CREATE TABLE trades
> {
>     symbol text,
>     date date,
>     time time,
>     priceMantissa int,
>     priceExponent tinyint,
>     volume int,
>     PRIMARY KEY ((symbol, date), time)
> };
> {code}
> The trades will be inserted with an increasing time and sorted in the same order. As
we can have to process a large amount of data, we want to try to limit ourself to the cases
where we can build the groups on the flight (which is not a requirement in the SQL world).
> If we want to get the number of trades per minutes with the SQL syntax we will have to
write:
> {{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' AND date
= '2016-01-11' GROUP BY hour(time), minute(time);}}
> which is fine. The problem is that if the user invert by mistake the functions like that:
> {{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' AND date
= '2016-01-11' GROUP BY minute(time), hour(time);}}
> the query will return weird results.
> The only way to prevent that would be to check the function order and make sure that
we do not allow to skip functions (e.g. {{GROUP BY hour(time), second(time)}}).
> In my opinion a function like {{floor(<columnName>, <time range>)}} will
be much better as it does not allow for this type of mistakes and is much more flexible (you
can create 5 minutes buckets if you want to).
> {{SELECT floor(time, m), count() FROM Trades WHERE symbol = 'AAPL' AND date = '2016-01-11'
GROUP BY floor(time, m);}}
> An important aspect to keep in mind with a function like {{floor}} is the starting point.
For a query like:  {{SELECT floor(time, m), count() FROM Trades WHERE symbol = 'AAPL' AND
date = '2016-01-11' AND time >= '01:30:00' AND time =< '07:30:00' GROUP BY floor(time,
2h);}}, I think that ideally the result should return 3 groups: {{01:30:00}}, {{03:30:00}}
and {{05:30:00}}.  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message