cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From onmstester onmstester <>
Subject Re: Interesting Results - Cassandra Benchmarks over Time Series Data for IoT Use Case I
Date Sat, 19 May 2018 03:25:01 GMT
I recommend you to review newts data model, which is a time-series data model upon cassandra:

Sent using Zoho Mail

First the use-case: We have time-series of data from devices on several sites, where each
device (with a unique dev_id) can have several sensors attached to it. Most queries however
are both time limited as well as over a range of dev_ids, even for a single sensor (Multi-sensor
joins are a whole different beast for another day!). We want to have a schema where the query
can complete in time linear to the query ranges for both devices and time range, immaterial
(largely) to the total data size. 

So we explored several different primary key definitions, learning from the best-practices
communicated on this mailing list and over the interwebs. While details about the setup (Spark
over C*) and schema are in a companion blog/site here [1], we just mention the primary keys
and the key points here. 

PRIMARY KEY (dev_id, day, rec_time)

PRIMARY KEY ((dev_id, rec_time)

PRIMARY KEY (day, dev_id, rec_time)

PRIMARY KEY ((day, dev_id), rec_time)

PRIMARY KEY ((dev_id, day), rec_time)

Combination of above by adding a year field in the schema.

The main takeaway (again, please read through the details at [1]) is that we really don't
have a single schema to answer the use case above without some drawback. Thus while the ((day,
dev_id), rec_time) gives a constant response, it is dependent entirely on the total data size
(full scan). On the other hand, (dev_id, day, rec_time) and its counterpart (day, dev_id,
rec_time) provide acceptable results, we have the issue of very large partition space in the
first, and hotspot while writing for the latter case.

We also observed that having a multi-field partition key allows for fast querying only if
the "=" is used going left to right. If an IN() (for specifying eg. range of time or list
of devices) is used once that order, than any further usage of IN() removes any benefit (i.e.
a near full table scan).

Another useful learning was that using the IN() to query for days is less useful than putting
in a range query.

Currently, it seems we are in a bind --- should we use a different data store for our usecase
(which seems quite typical for IoT)? Something like HDFS or Parquet? We would love to get
feedback on the benchmarking results and how we can possibly improve this and share widely.

[1] Cassandra Benchmarks over Time Series Data for IoT Use Case



Arbab Khalil

Software Design Engineer

View raw message