cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Lerer (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-9778) CQL support for time series aggregation
Date Tue, 12 Jan 2016 09:08:40 GMT


Benjamin Lerer commented on CASSANDRA-9778:

While working on CASSANDRA-10707, I started to think about that windowed aggregates problem.

I really think that it is a usefull functionality that we should have. I have used it for
years, with MySQL, for analysing data.

Regarding the implementation, even if in general I also prefer to follow the SQL syntax, I
do not believe it will be a good fit for Cassandra.

If we have a table like:
    symbol text,
    date date,
    time time,
    priceMantissa int,
    priceExponent tinyint,
    volume int,
    PRIMARY KEY ((symbol, date), time)
The trades will be inserted with an increasing time and sorted in the same order, which is
really the use case targeted by this ticket and by CASSANDRA-10707. As we can have to process
a large amount of data, we want to try to limit ourself to the cases where we can build the
groups on the flight (which is not a requirement in the SQL world).

If we want to get the number of trades per minutes with the SQL syntax we will have to write:
{{SELECT hour(time), minute(time), count((*)) FROM Trades WHERE symbol = 'AAPL' AND date =
'2016-01-11' GROUP BY hour(time), minute(time);}}
which is fine. The problem is that if the user invert by mistake the functions like that:
{{SELECT hour(time), minute(time), count((*)) FROM Trades WHERE symbol = 'AAPL' AND date =
'2016-01-11' GROUP BY minute(time), hour(time);}}
the query will return weird results if it is a normal SELECT and will be pretty inefficient
within a MV.
The only way to prevent that would be to check the function order and make sure that we do
not allow to skip functions (e.g. {{ GROUP BY  hour(time), second(time)}}).

In my opinion a function like {{floor(<columnName>, <time range>)}} will be much
better as it does not allow for this type of mistakes and is much more flexible (you can create
5 minutes buckets if you want to).
{{SELECT floor(time, m), count((*)) FROM Trades WHERE symbol = 'AAPL' AND date = '2016-01-11'
GROUP BY floor(time, m);}}

> CQL support for time series aggregation
> ---------------------------------------
>                 Key: CASSANDRA-9778
>                 URL:
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: CQL
>            Reporter: Jonathan Ellis
>            Assignee: Benjamin Lerer
>             Fix For: 3.x
> Along with MV (CASSANDRA-6477), time series aggregation or "rollups" are a common design
pattern in cassandra applications.  I'd like to add CQL support for this along these lines:
> {code}
> SELECT exchange, day, day_time(1h) AS hour, symbol, avg(price), sum(volume)
> FROM stocks
> GROUP BY exchange, day, symbol, hour
> PRIMARY KEY  ((exchange, day), hour, symbol);
> {code}

This message was sent by Atlassian JIRA

View raw message