cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Junaid Nasir <jna...@an10.io>
Subject C* data modeling for time series
Date Wed, 26 Jul 2017 12:15:30 GMT
I have a C* cluster (3 nodes) with some 60gb data (replication factor 2).
when I started using C* coming from SQL background didn't give much thought
about modeling the data correctly. so what I did was

CREATE TABLE data ( deviceId int,
                    time timestamp,
                    field1 text,
                    filed2 text,
                    field3 text,
                    PRIMARY KEY(deviceId, time)) WITH CLUSTERING ORDER
BY (time ASC);

but most of the queries I run (using spark and datastax connector) compares
data of different devices for some time period. for example

SELECT * FROM data WHERE time > '2017-07-01 12:00:00';

from my understanding this runs a full table scan. as shown in spark UI
(from DAG visualization "Scan
org.apache.spark.sql.cassandra.CassandraSourceRelation@32bb7d65") meaning
C* will read all the data and then filter for time. Spark jobs runs for
hours even for smaller time frames.

what is the right approach for data modeling for such queries?. I want to
get a general idea of things to look for when modeling such data.
really appreciate all the help from this community :). if you need any
extra details please ask me here.

Regards,
Junaid

Mime
View raw message