Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
MIME-Version: 1.0
In-Reply-To: <1954956784.7708782.1457618067317.JavaMail.yahoo@mail.yahoo.com>
References: 
 <1954956784.7708782.1457618067317.JavaMail.yahoo.ref@mail.yahoo.com>
	<1954956784.7708782.1457618067317.JavaMail.yahoo@mail.yahoo.com>
Date: Thu, 10 Mar 2016 10:42:38 -0500
Message-ID: 
 <CAOxAL639TVzc7ksEUrEskbQ1c_UCy2TpQR3p2aCnT_WEWopgbA@mail.gmail.com>
Subject: Re: Strategy for dividing wide rows beyond just adding to the
 partition key
From: Jack Krupansky <jack.krupansky@gmail.com>
To: user@cassandra.apache.org, Jason Kania <jason.kania@ymail.com>
Content-Type: multipart/alternative; boundary=001a11426560ffd585052db3a969

--001a11426560ffd585052db3a969
Content-Type: text/plain; charset=UTF-8

There is an effort underway to support wider rows:
https://issues.apache.org/jira/browse/CASSANDRA-9754

This won't help you now though. Even with that improvement you still may
need a more optimal data model since large-scale scanning/filtering is
always a very bad idea with Cassandra.

The data modeling methodology for Cassandra dictates that queries drive the
data model and that each form of query requires a separate table ("query
table.") Materialized view can automate that process for a lot of cases,
but in any case it does sound as if some of your queries do require
additional tables.

As a general proposition, Cassandra should not be used for heavy filtering
- query tables with the filtering criteria baked into the PK is the way to
go.


-- Jack Krupansky

On Thu, Mar 10, 2016 at 8:54 AM, Jason Kania <jason.kania@ymail.com> wrote:

> Hi,
>
> We have sensor input that creates very wide rows and operations on these
> rows have started to timeout regulary. We have been trying to find a
> solution to dividing wide rows but keep hitting limitations that move the
> problem around instead of solving it.
>
> We have a partition key consisting of a sensorUnitId and a sensorId and
> use a time field to access each column in the row. We tried adding a time
> based entry, timeShardId, to the partition key that consists of the year
> and week of year during which the reading was taken. This works for a
> number of queries but for scanning all the readings against a particular
> sensorUnitId and sensorId combination, we seem to be stuck.
>
> We won't know the range of valid values of the timeShardId for a given
> sensorUnitId and sensorId combination so would have to write to an
> additional table to track the valid timeShardId. We suspect this would
> create tombstone accumulation problems given the number of updates required
> to the same row so haven't tried this option.
>
> Alternatively, we hit a different bottleneck in the form of SELECT
> DISTINCT in trying to directly access the partition keys. Since SELECT
> DISTINCT does not allow for a where clause to filter on the partition key
> values, we have to filter several hundred thousand partition keys just to
> find those related to the relevant sensorUnitId and sensorId. This problem
> will only grow worse for us.
>
> Are there any other approaches that can be suggested? We have been looking
> around, but haven't found any references beyond the initial suggestion to
> add some sort of shard id to the partition key to handle wide rows.
>
> Thanks,
>
> Jason
>

--001a11426560ffd585052db3a969
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">There is an effort underway to support wider rows:<div><a =
href=3D"https://issues.apache.org/jira/browse/CASSANDRA-9754" rel=3D"norefe=
rrer" target=3D"_blank" style=3D"font-size:12.8px">https://issues.apache.or=
g/jira/browse/CASSANDRA-9754</a><br></div><div><br></div><div>This won&#39;=
t help you now though. Even with that improvement you still may need a more=
 optimal data model since large-scale scanning/filtering is always a very b=
ad idea with Cassandra.</div><div><br></div><div>The data modeling methodol=
ogy for Cassandra dictates that queries drive the data model and that each =
form of query requires a separate table (&quot;query table.&quot;) Material=
ized view can automate that process for a lot of cases, but in any case it =
does sound as if some of your queries do require additional tables.</div><d=
iv><br></div><div>As a general proposition, Cassandra should not be used fo=
r heavy filtering - query tables with the filtering criteria baked into the=
 PK is the way to go.</div><div><br></div></div><div class=3D"gmail_extra">=
<br clear=3D"all"><div><div class=3D"gmail_signature"><div dir=3D"ltr">-- J=
ack Krupansky</div></div></div>
<br><div class=3D"gmail_quote">On Thu, Mar 10, 2016 at 8:54 AM, Jason Kania=
 <span dir=3D"ltr">&lt;<a href=3D"mailto:jason.kania@ymail.com" target=3D"_=
blank">jason.kania@ymail.com</a>&gt;</span> wrote:<br><blockquote class=3D"=
gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-=
left:1ex"><div><div style=3D"color:#000;background-color:#fff;font-family:H=
elveticaNeue,Helvetica Neue,Helvetica,Arial,Lucida Grande,sans-serif;font-s=
ize:16px"><div>Hi,</div><div><br></div><div>We have sensor input that creat=
es very wide rows and operations on these rows have started to timeout regu=
lary. We have been trying to find a solution to dividing wide rows but keep=
 hitting limitations that move the problem around instead of solving it.</d=
iv><div><br></div><div dir=3D"ltr">We have a partition key consisting of a =
sensorUnitId and a sensorId and use a time field to access each column in t=
he row. We tried adding a time based entry, timeShardId, to the partition k=
ey that consists of the year and week of year during which the reading was =
taken. This works for a number of queries but for scanning all the readings=
 against a particular sensorUnitId and sensorId combination, we seem to be =
stuck.</div><div dir=3D"ltr"><br></div><div dir=3D"ltr">We won&#39;t know t=
he range of valid values of the timeShardId for a given sensorUnitId and se=
nsorId combination so would have to write to an additional table to track t=
he valid timeShardId. We suspect this would create tombstone accumulation p=
roblems given the number of updates required to the same row so haven&#39;t=
 tried this option.<br></div><div dir=3D"ltr"><br></div><div dir=3D"ltr">Al=
ternatively, we hit a different bottleneck in the form of SELECT DISTINCT i=
n trying to directly access the partition keys. Since SELECT DISTINCT does =
not allow for a where clause to filter on the partition key values, we have=
 to filter several hundred thousand partition keys just to find those relat=
ed to the relevant sensorUnitId and sensorId. This problem will only grow w=
orse for us.<br></div><div dir=3D"ltr"><br></div><div dir=3D"ltr">Are there=
 any other approaches that can be suggested? We have been looking around, b=
ut haven&#39;t found any references beyond the initial suggestion to add so=
me sort of shard id to the partition key to handle wide rows.</div><div dir=
=3D"ltr"><br></div><div dir=3D"ltr">Thanks,</div><div dir=3D"ltr"><br></div=
><div dir=3D"ltr">Jason<br></div></div></div></blockquote></div><br></div>

--001a11426560ffd585052db3a969--