cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "K. Lawson" <klawso...@gmail.com>
Subject Re: What is the best way to model my time series?
Date Fri, 25 Mar 2016 19:27:39 GMT
Sorry Gerard, I'm afraid I'm not familiar with that project.

The time series I've described is a relatively minor component of an
application which is already powered by Cassandra, so you can see why I'd
prefer a viable way (which I'm quickly learning may not exist) to modelit
in Cassandra.

On Fri, Mar 25, 2016 at 2:04 PM, Gerard Maas <gerard.maas@gmail.com> wrote:

> Hi,
>
> It sounds to me like Apache Kafka would be a better fit for your
> requirements. Have you considered that option?
>
> kr, Gerard
> Datastax MVP for Apache Cassandra (so, I'm not suggesting other tech for
> any other reason that seeing it as a better fit)
>
> On Fri, Mar 25, 2016 at 1:31 PM, K. Lawson <klawson88@gmail.com> wrote:
>
>> While adhering to best practices, I am trying to model a time series in
>> Cassandra that is compliant with the following access pattern directives:
>>
>> - Is to be both read and shrank by a single party, grown by multiple
>> parties
>> - Is to be read as a queue (in other words, its entries, from first to
>> last, are to be paged through in order)
>> - Is to grown as a queue (in other words, new entries (the number of
>> which is expected to fall in the range of 0 to a couple of hundred per day)
>> are always APPENDED to the series)
>> - Is to be shrunk by way of the removal of any entries which have been
>> processed by the application (immediately upon completion of said
>> processing)
>>
>> So far, I've come up with four solutions, listed below (along with their
>> pros and cons), that are compliant with
>> the directives given above; is there any solution superior to these, and
>> if not, which one of these is most optimal?
>>
>>
>>
>> Solution #1:
>>
>>
>> //Processing position markers (saved somewhere on disk)
>> mostRecentProcessedItemInsertTime = 0
>> mostRecentProcessedItemInsertDayStartTime = 0
>>
>> CREATE TABLE IF NOT EXISTS solution_table_1
>> (
>> itemInsertDayStartTime timestamp
>> itemInsertTime timestamp
>> itemId timeuuid
>> PRIMARY KEY (itemInsertDayStartTime, itemInsertTime, itemId)
>> );
>> //Initial row retrieval query (presumably, the position markers will be
>> appropriately updated after each retrieval)
>>
>> SELECT *
>>
>> FROM solution_table_1
>>
>> WHERE itemInsertDayStartTime IN
>> (mostRecentProcessedItemInsertDayStartTime,
>> mostRecentProcessedItemInsertDayStartTime + 86400000, ...)
>>
>> AND itemInsertTime > mostRecentProcessedItemInsertTime
>>
>> LIMIT 30
>>
>> Pros:
>> - Shards table data across the cluster
>>
>> Cons:
>> - Requires the maintenance of position markers
>> - Requires the explicit specification of partitions (which may or may not
>> have data) to target for retrievals which page the table data by
>> itemInsertTime
>> - Requires correspondence with multiple nodes to satisfy retrievals which
>> page the table data by itemInsertTime
>>
>>
>> Solution #2:
>>
>>
>> CREATE TABLE IF NOT EXISTS solution_table_2
>> (
>>   itemInsertTime timestamp
>> itemId timeuuid
>> PRIMARY KEY (itemInserTime, itemId)
>> );
>> CREATE INDEX IF NOT EXISTS ON solution_table_2 (itemInsertTime);
>>
>> //Initial row retrieval query
>> SELECT * FROM solution_table_2 WHERE itemInsertTime > 0 LIMIT 30 ALLOW
>> FILTERING
>>
>> Pros:
>> - Shards table data across the cluster
>> - Enables retrievals which page table data by itemInsertTime to be
>> conducted without explicitly specifying partitions to target
>>
>> Cons:
>> - Specifies the creation of an index on a high-cardinality column
>> - Requires correspondence with multiple nodes, as well as data filtering,
>> to satisfy retrievals which page the table data by itemInsertTime
>> Solution #3:
>>
>> CREATE TABLE IF NOT EXISTS solution_table_3
>> (
>> itemInsertTime timestamp
>> itemId timeuuid
>> itemInsertDayStartTime timestamp
>> PRIMARY KEY (itemInsertTime, itemId)
>> );
>> CREATE INDEX IF NOT EXISTS ON solution_table_3 (itemInsertDayStartTime);
>> //Initial row retrieval query
>> SELECT * FROM solution_table_3 WHERE itemInsertDayStartTime > 0 LIMIT 30
>> ALLOW FILTERING
>>
>> Pros:
>> - Shards table data across the cluster
>> - Enables retrievals which page table data by itemInsertTime to be
>> conducted without explicitly specifying partitions to target
>> - Specifies the creation of an index on a column with anticipatively
>> suitable cardinality
>>
>> Cons:
>> - Requires correspondence with multiple nodes, as well as data filtering,
>> to satisfy retrievals which page the table data by itemInsertTime
>> Solution #4:
>>
>> CREATE TABLE IF NOT EXISTS solution_table_4
>> (
>> dummyPartitionInt int
>> itemInsertTime timestamp
>> itemId timeuuid
>> PRIMARY KEY (dummyPartitionInt, itemInsertTime, itemId)
>> );
>> //Initial row retrieval query (assuming all rows are inserted with a
>> dummyPartitionInt value of 0)
>> SELECT * FROM solution_table_4 WHERE dummyPartitionInt = 0 AND
>> itemInsertTime > 0 LIMIT 30
>>
>>
>> Pros:
>> - Enables retrieval to be satisfied with a single replica set
>> - Enables retrievals which page table data by itemInsertTime to be
>> conducted without explicitly specifying more than one partition to target
>>
>> Cons:
>> - Requires the use of a "dummy" column
>> - Specifies the constriction of table data (and as a result, all
>> operations on it) to a single partition
>>
>
>

Mime
View raw message