cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Justin Cameron <jus...@instaclustr.com>
Subject Re: Time series modeling in C* for range queries
Date Sun, 19 Nov 2017 23:51:31 GMT
Hi Junaid,

Using a "bucketing" key ("day") is the recommended way to limit the size of
partitions. In your case you would probably need something like:
PRIMARY KEY ((deviceid,
day), datetime).

Have you considered computing a running aggregate as the data comes into
Cassandra? Rather than execute a range query to compute the aggregate when
you need to retrieve it, you could query the pre-aggregated value
associated with the device. If you are able to tolerate the risk of small
data inaccuracies potentially being introduced each time an update fails, a
counter column is the most efficient way to achieve this -
http://cassandra.apache.org/doc/latest/cql/types.html#counters.

If you need a guaranteed accurate value then the IN query with up to 90
partitions could be your best option, especially if you don't need to
execute the query regularly.

Your schema/use case is also probably a good candidate for
TimeWindowCompactionStrategy - see
http://cassandra.apache.org/doc/latest/operating/compaction.html#time-window-compactionstrategy


Justin

On Mon, 20 Nov 2017 at 06:10 Junaid Nasir <jnasir@an10.io> wrote:

> We are building a IoT platform where time series data from millions of
> devices is to be collected and then used to do some analytics pertaining to
> Business Intelligence/Analytics (BI/BA).
>
> Within the above context, we are running into the issue of have range
> based queries, where the granularity of the range can change. Furthermore,
> the query aggregation unit can be on the basis of device ID and that too
> possibly a collection of such devices.
>
> This leads us to model the table in the following way.
>
> C* table schemas
>
>                   // Per sensor table
> CREATE TABLE ks.fuel_signal (
>     deviceid text,
>     datetime timestamp,
>     value text,
>     PRIMARY KEY ((deviceid), datetime)
> ) WITH CLUSTERING ORDER BY (datetime ASC)
> // Per device
> CREATE TABLE ks.device1 (
>     sensor text,
>     datetime timestamp,
>     value text,
>     PRIMARY KEY ((sensor), datetime)
> ) WITH CLUSTERING ORDER BY (datetime ASC)
>
>                 [image: Mixmax]
> <https://mixmax.com/r/592fe3b429b79365389d2354> Not using Mixmax yet?
> <https://mixmax.com/r/592fe3b429b79365389d2354>
>
> While the above does solve the need to query a single sensor value over a
> range of dev_ids, it does pose a problem when doing the same query for a
> time range.The problem, simply put, is that the time-cost of doing a range
> query (for the same range) in the above model increase with the size of the
> total table (we are using spark DF for this purpose).
>
> The first question is that whether their is some best practice to handle
> such a flow, which it seems to us is quite natural and should be common.
>
> Then, it seems to us, based on reading the mailing list and previous
> discussion here that we can use date time stamps which collocate the
> date_range as a table for the primary key.
>
>
> date time buckets schema
>
>                   CREATE TABLE ks.fuel_signal (
>     deviceid text,
>     day date
>     datetime timestamp,
>     value text,
>     PRIMARY KEY ((day), datetime)
> ) WITH CLUSTERING ORDER BY (datetime ASC)
>
>                 [image: Mixmax]
> <https://mixmax.com/r/592fe3b429b79365389d2354> Not using Mixmax yet?
> <https://mixmax.com/r/592fe3b429b79365389d2354>
>
>
> but this would be a very large number of IN [Array] if the date_range is
> daily and we have a query that is over last 3 months.
>
> So, is this the right way or is there a better way?
>
> Thanks,
> Junaid
>
-- 


*Justin Cameron*Senior Software Engineer


<https://www.instaclustr.com/>


This email has been sent on behalf of Instaclustr Pty. Limited (Australia)
and Instaclustr Inc (USA).

This email and any attachments may contain confidential and legally
privileged information.  If you are not the intended recipient, do not copy
or disclose its content, but please reply to this email immediately and
highlight the error to the sender and then immediately delete the message.

Mime
View raw message