cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Krupansky <>
Subject Re: Strategy for dividing wide rows beyond just adding to the partition key
Date Fri, 11 Mar 2016 21:45:52 GMT
I'll stay away from advising on a specific schema per se, but I'll stick to
the advice that you need to make sure that your queries are depending
solely on the columns of the primary key or relatively short slices/scans,
rather than run the risk of very long scans or having to process multiple
partitions for a single query. That's canned to some extent, but still

Of course we generally wish to avoid hotspots, but with time series they
are unavoidable. I mean, sure you could place successive events at separate
partitions, but then you can't do any kind of scanning/slicing.

But, events for separate sensors are not true hotspots in the traditional
sense - unless you have only a single sensor/unit.

After considering your queries, the next step is to consider the
cardinality of your data - how many sensors, how many units, rate of
events, etc. That will feedback into queries as well, such as how big a
slice or scan might be, as well as sizing of partitions.

So, how many sensor units do you expect, how many sensors per unit, and
expected rate of events per sensor?

Try not to jump too quickly to specific solutions - there really is a
method to understanding all of this other stuff upfront.

-- Jack Krupansky

On Thu, Mar 10, 2016 at 12:39 PM, Jason Kania <> wrote:

> Jack,
> Thanks for the response. I don't think I provided enough information and
> used the wrong terminology as your response is more the canned advice is
> response to Cassandra antipatterns.
> To make this clearer, this is what we are doing:
> create table sensorReadings (
> sensorUnitId int,
> sensorId int,
> time timestamp,
> timeShard int,
> readings blob,
> primary key((sensorUnitId, sensorId, timeShard), time);
> where timeShard is a combination of year and week of year
> For known time range based queries, this works great. However, the
> specific problem is in knowing the maximum and minimum timeShard values
> when we want to select the entire range of data. Our understanding is that
> if we update another related table with the maximum and minimum timeShard
> value for a given sensorUnitId and sensorId combination, we will create a
> hotspot and lots of tombstones. If we SELECT DISTINCT, we get a huge list
> of partition keys for the table because we cannot reduce the scope with a
> where clause.
> If there is a recommended pattern that solves this, we haven't come across
> it.
> I hope makes the problem clearer.
> Thanks,
> Jason
> ------------------------------
> *From:* Jack Krupansky <>
> *To:*; Jason Kania <>
> *Sent:* Thursday, March 10, 2016 10:42 AM
> *Subject:* Re: Strategy for dividing wide rows beyond just adding to the
> partition key
> There is an effort underway to support wider rows:
> This won't help you now though. Even with that improvement you still may
> need a more optimal data model since large-scale scanning/filtering is
> always a very bad idea with Cassandra.
> The data modeling methodology for Cassandra dictates that queries drive
> the data model and that each form of query requires a separate table
> ("query table.") Materialized view can automate that process for a lot of
> cases, but in any case it does sound as if some of your queries do require
> additional tables.
> As a general proposition, Cassandra should not be used for heavy filtering
> - query tables with the filtering criteria baked into the PK is the way to
> go.
> -- Jack Krupansky
> On Thu, Mar 10, 2016 at 8:54 AM, Jason Kania <>
> wrote:
> Hi,
> We have sensor input that creates very wide rows and operations on these
> rows have started to timeout regulary. We have been trying to find a
> solution to dividing wide rows but keep hitting limitations that move the
> problem around instead of solving it.
> We have a partition key consisting of a sensorUnitId and a sensorId and
> use a time field to access each column in the row. We tried adding a time
> based entry, timeShardId, to the partition key that consists of the year
> and week of year during which the reading was taken. This works for a
> number of queries but for scanning all the readings against a particular
> sensorUnitId and sensorId combination, we seem to be stuck.
> We won't know the range of valid values of the timeShardId for a given
> sensorUnitId and sensorId combination so would have to write to an
> additional table to track the valid timeShardId. We suspect this would
> create tombstone accumulation problems given the number of updates required
> to the same row so haven't tried this option.
> Alternatively, we hit a different bottleneck in the form of SELECT
> DISTINCT in trying to directly access the partition keys. Since SELECT
> DISTINCT does not allow for a where clause to filter on the partition key
> values, we have to filter several hundred thousand partition keys just to
> find those related to the relevant sensorUnitId and sensorId. This problem
> will only grow worse for us.
> Are there any other approaches that can be suggested? We have been looking
> around, but haven't found any references beyond the initial suggestion to
> add some sort of shard id to the partition key to handle wide rows.
> Thanks,
> Jason

View raw message