cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jon Haddad <...@jonhaddad.com>
Subject Re: Time series modeling in C* for range queries
Date Mon, 20 Nov 2017 05:18:34 GMT
Hi Junaid,

I wrote a blog post a few months ago on massively scalable time series, going into a couple
techniques on bucketing that you might find helpful.

http://thelastpickle.com/blog/2017/08/02/time-series-data-modeling-massive-scale.html <http://thelastpickle.com/blog/2017/08/02/time-series-data-modeling-massive-scale.html>

Jon

> On Nov 19, 2017, at 2:09 PM, Junaid Nasir <jnasir@an10.io> wrote:
> 
> We are building a IoT platform where time series data from millions of devices is to
be collected and then used to do some analytics pertaining to Business Intelligence/Analytics
(BI/BA). 
> 
> Within the above context, we are running into the issue of have range based queries,
where the granularity of the range can change. Furthermore, the query aggregation unit can
be on the basis of device ID and that too possibly a collection of such devices. 
> 
> This leads us to model the table in the following way. 
> 
> C* table schemas
>                   
> // Per sensor table
> CREATE TABLE ks.fuel_signal (
>     deviceid text,
>     datetime timestamp,
>     value text,
>     PRIMARY KEY ((deviceid), datetime)
> ) WITH CLUSTERING ORDER BY (datetime ASC)
> // Per device
> CREATE TABLE ks.device1 (
>     sensor text,
>     datetime timestamp,
>     value text,
>     PRIMARY KEY ((sensor), datetime)
> ) WITH CLUSTERING ORDER BY (datetime ASC)
> 
>                 
>  <https://mixmax.com/r/592fe3b429b79365389d2354> NOT USING MIXMAX YET? <https://mixmax.com/r/592fe3b429b79365389d2354>
> 
> While the above does solve the need to query a single sensor value over a range of dev_ids,
it does pose a problem when doing the same query for a time range.The problem, simply put,
is that the time-cost of doing a range query (for the same range) in the above model increase
with the size of the total table (we are using spark DF for this purpose). 
> 
> The first question is that whether their is some best practice to handle such a flow,
which it seems to us is quite natural and should be common. 
> 
> Then, it seems to us, based on reading the mailing list and previous discussion here
that we can use date time stamps which collocate the date_range as a table for the primary
key. 
> 
> 
> date time buckets schema
>                   
> CREATE TABLE ks.fuel_signal (
>     deviceid text,
>     day date
>     datetime timestamp,
>     value text,
>     PRIMARY KEY ((day), datetime)
> ) WITH CLUSTERING ORDER BY (datetime ASC)
> 
>                 
>  <https://mixmax.com/r/592fe3b429b79365389d2354> NOT USING MIXMAX YET? <https://mixmax.com/r/592fe3b429b79365389d2354>
> 
> 
> but this would be a very large number of IN [Array] if the date_range is daily and we
have a query that is over last 3 months.
> 
> So, is this the right way or is there a better way?
> 
> Thanks,
> Junaid
> 


Mime
View raw message