cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Lerer (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-11871) Allow to aggregate by time intervals
Date Wed, 25 May 2016 13:14:12 GMT


Benjamin Lerer commented on CASSANDRA-11871:

>From the collection standpoint, if we have, for example values {{1, 2, 3, 4, 5}} written
in the column {{a}}, then something like {{GROUP BY parity(a)}} would not be allowed, right
(e.g. to return {{odd: \[1,3,5\], even: \[2,4\]}} or similar, depending on the output formatting)?

No it will not be allowed. The function is not {{monotonic}}.

Just wondering, is this imposed by some sort of limitation, or depends on the data? Since
if, for example, aggregate function expects some group monotonicity, this would help to restrict
the values, but if it's commutative, it shouldn't matter, right?

We only allow grouping on {{partition key + clustering prefix}} because, as rows are sorted
by the values of their clustering columns, it allow us to build the aggregates on the flight.
We know exactly when we have reached our page limit and can return the results to the user.

> Allow to aggregate by time intervals
> ------------------------------------
>                 Key: CASSANDRA-11871
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: CQL
>            Reporter: Benjamin Lerer
>            Assignee: Benjamin Lerer
>             Fix For: 3.x
> For time series data it can be usefull to aggregate by time intervals.
> The idea would be to add support for one or several functions in the {{GROUP BY}} clause.
> Regarding the implementation, even if in general I also prefer to follow the SQL syntax,
I do not believe it will be a good fit for Cassandra.
> If we have a table like:
> {code}
> {
>     symbol text,
>     date date,
>     time time,
>     priceMantissa int,
>     priceExponent tinyint,
>     volume int,
>     PRIMARY KEY ((symbol, date), time)
> };
> {code}
> The trades will be inserted with an increasing time and sorted in the same order. As
we can have to process a large amount of data, we want to try to limit ourself to the cases
where we can build the groups on the flight (which is not a requirement in the SQL world).
> If we want to get the number of trades per minutes with the SQL syntax we will have to
> {{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' AND date
= '2016-01-11' GROUP BY hour(time), minute(time);}}
> which is fine. The problem is that if the user invert by mistake the functions like that:
> {{SELECT hour(time), minute(time), count() FROM Trades WHERE symbol = 'AAPL' AND date
= '2016-01-11' GROUP BY minute(time), hour(time);}}
> the query will return weird results.
> The only way to prevent that would be to check the function order and make sure that
we do not allow to skip functions (e.g. {{GROUP BY hour(time), second(time)}}).
> In my opinion a function like {{floor(<columnName>, <time range>)}} will
be much better as it does not allow for this type of mistakes and is much more flexible (you
can create 5 minutes buckets if you want to).
> {{SELECT floor(time, m), count() FROM Trades WHERE symbol = 'AAPL' AND date = '2016-01-11'
GROUP BY floor(time, m);}}
> An important aspect to keep in mind with a function like {{floor}} is the starting point.
For a query like:  {{SELECT floor(time, m), count() FROM Trades WHERE symbol = 'AAPL' AND
date = '2016-01-11' AND time >= '01:30:00' AND time =< '07:30:00' GROUP BY floor(time,
2h);}}, I think that ideally the result should return 3 groups: {{01:30:00}}, {{03:30:00}}
and {{05:30:00}}.  

This message was sent by Atlassian JIRA

View raw message