cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From G P <>
Subject Estimating partition size for C*2.X and C*3.X and Time Series Data Modelling.
Date Mon, 20 Jun 2016 16:12:07 GMT

I'm currently enrolled in a master's degree and my thesis project involves the usage of Big
Data tools in the context of Smart Grid applications. I explored sever storage solutions and
found Cassandra to be fitting to my problem.
The data is mostly Time Series data, incoming from multiple PLCs, currently being captured
and stored by a proprietary SCADA software connected to a MSSQL server. Reading into C* storage
engine and how Time Series should be modelled, it is inevitable that I have to use a sort
of time bucketing for splitting into multiple partitions.

Here is the issue, in the MSSQL server, each PLC has very wide tables (5 at the moment for
one building) with around 36 columns of data being collected every 10 seconds. Data is being
queried as much as 15 columns at a time with time ranges varying between 1 hour and a whole
month. A simple mapping of the same tables in MSSQL to C* is not recommended due to the way
C*2.X stores its data.

I took the DS220: Data Modelling Course, that showcases two formulas for estimating a partition
size based on the Table design.

Note: This Ps formula does not account for column name length, TTLs, counter columns, and
additional overhead.

If my calculations are correct, with a table such as the one below and a the time resolution
of 10 seconds, the Ps (Partition Size) would be shy of 10 MB (value often recommended) if
I partitioned it weekly.

    BuildingAnalyzer text,
    Time timestamp,
    P1 double,
    P2 double,
    P3 double,
    Acte1 int,
    Acte2 int,
    Acte3 int,
    PRIMARY KEY (BuildingAnalyzer, Time)

However, as of C*3.0, a major refactor of the storage engine brought efficiency in storage
costs. From what I could gather in [1], clustering columns and column name are no longer repeated
for each value in a record and, among other things, the timestamps for conflict resolution
(the 8 × Nv of the 2nd formula) can be stored only once per record if they have the same
value and are encoded as varints.

I also read [2], which explains the storage in intricate detail, adding too much complexity
to a simple estimation formula.

Is there any way to estimate partition size of a table with similar formulas as the ones above?
Should I just model my tables similar to what is done with metric collection (table with columns,
"parametername" and "value")?



Sorry for the long wall of text,
Best regards,
Gil Pinheiro.

View raw message