cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Haddad <...@jonhaddad.com>
Subject Re: Strategy for dividing wide rows beyond just adding to the partition key
Date Thu, 10 Mar 2016 17:47:56 GMT
Oops sorry, you wrote below that the shard is what I was suggesting.  I
didn't fully understand the problem you had.  I'll think about it a little
bit and come up w/ something.

On Thu, Mar 10, 2016 at 9:47 AM Jonathan Haddad <jon@jonhaddad.com> wrote:

> My advice was to use the date that the reading was recorded as part of the
> Partition key instead of some arbitrary shard id.  Then you don't have to
> look anything up in a different table.
>
>
>
> create table sensorReadings (
> sensorUnitId int,
> sensorId int,
> date_recorded date,
> time timestamp,
> timeShard int,
> readings blob,
> primary key((sensorUnitId, sensorId, date_recorded), time);
>
>
> On Thu, Mar 10, 2016 at 9:29 AM Jason Kania <jason.kania@ymail.com> wrote:
>
>> Hi Jonathan,
>>
>> Thanks for the response. To make this clearer, this is what we are doing:
>>
>> create table sensorReadings (
>> sensorUnitId int,
>> sensorId int,
>> time timestamp,
>> timeShard int,
>> readings blob,
>> primary key((sensorUnitId, sensorId, timeShard), time);
>>
>> where timeShard is a combination of year and week of year
>>
>> This works exactly as you mentioned when we know what time range we are
>> querying.
>>
>> The problem is that for those cases where we want to run through all the
>> readings for all timestamps, we don't know the first and last timeShard
>> value to use to constrain the query or iterate over each shard. Our
>> understanding is that updating another table with the maximum or minimum
>> timeShard values on every write to the above table would mean pounding a
>> single row with updates and running SELECT DISTINCT pulls all partition
>> keys.
>>
>> Hopefully this is clearer.
>>
>> Again, any suggestions would be appreciated.
>>
>> Thanks,
>>
>> Jason
>>
>> ------------------------------
>> *From:* Jonathan Haddad <jon@jonhaddad.com>
>> *To:* user@cassandra.apache.org; Jason Kania <jason.kania@ymail.com>
>> *Sent:* Thursday, March 10, 2016 11:21 AM
>> *Subject:* Re: Strategy for dividing wide rows beyond just adding to the
>> partition key
>>
>> Have you considered making the date (or week, or whatever, some time
>> component) part of your partition key?
>>
>> something like:
>>
>> create table sensordata (
>> sensor_id int,
>> day date,
>> ts datetime,
>> reading int,
>> primary key((sensor_id, day), ts);
>>
>> Then if you know you need data by a particular date range, just issue
>> multiple async queries for each day you need.
>>
>> On Thu, Mar 10, 2016 at 5:57 AM Jason Kania <jason.kania@ymail.com>
>> wrote:
>>
>> Hi,
>>
>> We have sensor input that creates very wide rows and operations on these
>> rows have started to timeout regulary. We have been trying to find a
>> solution to dividing wide rows but keep hitting limitations that move the
>> problem around instead of solving it.
>>
>> We have a partition key consisting of a sensorUnitId and a sensorId and
>> use a time field to access each column in the row. We tried adding a time
>> based entry, timeShardId, to the partition key that consists of the year
>> and week of year during which the reading was taken. This works for a
>> number of queries but for scanning all the readings against a particular
>> sensorUnitId and sensorId combination, we seem to be stuck.
>>
>> We won't know the range of valid values of the timeShardId for a given
>> sensorUnitId and sensorId combination so would have to write to an
>> additional table to track the valid timeShardId. We suspect this would
>> create tombstone accumulation problems given the number of updates required
>> to the same row so haven't tried this option.
>>
>> Alternatively, we hit a different bottleneck in the form of SELECT
>> DISTINCT in trying to directly access the partition keys. Since SELECT
>> DISTINCT does not allow for a where clause to filter on the partition key
>> values, we have to filter several hundred thousand partition keys just to
>> find those related to the relevant sensorUnitId and sensorId. This problem
>> will only grow worse for us.
>>
>> Are there any other approaches that can be suggested? We have been
>> looking around, but haven't found any references beyond the initial
>> suggestion to add some sort of shard id to the partition key to handle wide
>> rows.
>>
>> Thanks,
>>
>> Jason
>>
>>
>>
>>

Mime
View raw message