cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guillaume Charhon <>
Subject Re: How to organize a timeseries by device?
Date Mon, 09 Nov 2015 15:53:15 GMT
For the first table: (device_id, timestamp), should I add a bucket even if
I know I might have millions of events per device but never billions?

On Mon, Nov 9, 2015 at 4:37 PM, Jack Krupansky <>

> Cassandra is good at two kinds of queries: 1) access a specific row by a
> specific key, and 2) Access a slice or consecutive sequence of rows within
> a given partition.
> It is recommended to avoid ALLOW FILTERING. If it happens to work well for
> you, great, go for it, but if it doesn't then simply don't do it. Best to
> redesign your data model to play to Cassandra's strengths.
> If you bucket the time-based table, do a separate query for each time
> bucket.
> -- Jack Krupansky
> On Mon, Nov 9, 2015 at 10:16 AM, Guillaume Charhon <
>> wrote:
>> Kai, Jack,
>> On 1., should the bucket be a STRING with a date format or do I have a
>> better option ? For (device_id, bucket, timestamp), did you mean
>> ((device_id, bucket), timestamp) ?
>> On 2., what are the risks of timeout ? I currently have this warning:
>> "Cannot execute this query as it might involve data filtering and thus may
>> have unpredictable performance. If you want to execute this query despite
>> the performance unpredictability, use ALLOW FILTERING".
>> On Mon, Nov 9, 2015 at 3:02 PM, Kai Wang <> wrote:
>>> 1. Don't make your partition unbound. It's tempting to just use
>>> (device_id, timestamp). But soon or later you will have problem when time
>>> goes by. You can keep the partition bound by using (device_id, bucket,
>>> timestamp). Use hour, day, month or even year like Jack mentioned depending
>>> on the size of data.
>>> 2. As to your specific query, for a given partition and a time range, C*
>>> doesn't need to load the whole partition then filter. It only retrieves the
>>> slice within the time range from disk because the data is clustered by
>>> timestamp.
>>> On Mon, Nov 9, 2015 at 8:13 AM, Jack Krupansky <
>>> > wrote:
>>>> The general rule in Cassandra data modeling is to look at all of your
>>>> queries first and then to declare a table for each query, even if that
>>>> means storing multiple copies of the data. So, create a second table with
>>>> bucketed time as the partition key (hour, 15 minutes, or whatever time
>>>> interval makes sense to give 1 to 10 megabytes per partition) and time and
>>>> device as the clustering keys.
>>>> Or, consider DSE SEarch  and then you can do whatever ad hoc queries
>>>> you want using Solr. Or Stratio or TupleJump Stargate for an open source
>>>> Lucene plugin.
>>>> -- Jack Krupansky
>>>> On Mon, Nov 9, 2015 at 8:05 AM, Guillaume Charhon <
>>>>> wrote:
>>>>> Hello,
>>>>> We are currently storing geolocation events (about 1 per 5 minutes)
>>>>> for each device we track. We currently have 2 TB of data. I would like
>>>>> store the device_id, the timestamp of the event, latitude and longitude.
>>>>> though about using the device_id as the partition key and timestamp as
>>>>> clustering column. It is great as events are naturally grouped by device
>>>>> (very useful for our Spark jobs). However, if I would like to retrieve
>>>>> events of all devices of the last week I understood that Cassandra will
>>>>> need to load all data and filter which does not seems to be clean on
>>>>> long term.
>>>>> How should I create my model?
>>>>> Best Regards

View raw message