Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
MIME-Version: 1.0
In-Reply-To: <1842220918.12093.1457735878521.JavaMail.yahoo@mail.yahoo.com>
References: 
 <CAOxAL61tczUaPGqrAeApz2btUZTDbUSNJw5J6+RwyetQf+afBQ@mail.gmail.com>
	<1842220918.12093.1457735878521.JavaMail.yahoo@mail.yahoo.com>
Date: Fri, 11 Mar 2016 18:22:48 -0500
Message-ID: 
 <CAOxAL60ZPQL6j59EcUKPKMjX95tMu+_QU-RQLvSmJuqqZPSWgQ@mail.gmail.com>
Subject: Re: Strategy for dividing wide rows beyond just adding to the
 partition key
From: Jack Krupansky <jack.krupansky@gmail.com>
To: user@cassandra.apache.org, Jason Kania <jason.kania@ymail.com>
Content-Type: multipart/alternative; boundary=001a1144f9128dae4d052dce35de

--001a1144f9128dae4d052dce35de
Content-Type: text/plain; charset=UTF-8

Thanks for the additional information, but there is still not enough color
on the queries and too much focus on a premature data model.

Is this 5000 readings for a single sensor of a single sensor unit, or for
all sensors of a specified unit, or... both?

I presume you want "next" and "previous" 5000 readings as well as first and
last, but... you will have to confirm that.

One technique is to store the bulk of your raw sensor data in a separate
table and then simply store the PK of that data in your time series. That
way you can have a much wider row of time series (number of rows) without
hitting a bulk size issue for the partition. But... I don't want to jump to
solutions until we have a firmer handle on the query side of the fence.

-- Jack Krupansky

On Fri, Mar 11, 2016 at 5:37 PM, Jason Kania <jason.kania@ymail.com> wrote:

> Jack,
>
> Thanks for the response.
>
> We are targeting our database design to 10000 sensor units and each sensor
> unit has 32 sensors. We are seeing about 700 events per day per sensor,
> each providing about 2K of data. Based on keeping each partition to about
> 10 Mb (based on readings we saw on performance), we chose to break our
> partitions on a weekly basis. This is possibly finer than we need as we
> were seeing timeouts only once a single partition was about 150Mb in size
>
> When pulling in data, we will typically need to pull 1 to 4 months of data
> for our analysis and will use only the sensorUnitId and sensorId to
> uniquely identify the data source with the timeShard value used to break up
> our partitions. We have handling to sequentially scan based on our
> "timeShard" value, but don't have a good handle on the determination of the
> "timeShard" portion of the partition key at read time. The data starts
> coming in when a subscriber starts using our system and finishes when they
> discontinue service or put the service on hold temporarily.
>
> When I talk about hotspots, it isn't the time series data that is the
> concern, it is with respect to storing the maximum and minimum timeShard
> values in another table for subsequent lookup or the cost of running the
> current implementation of SELECT DISTINCT. We need to run queries such as
> getting the first or last 5000 sensor readings when we don't know the time
> frame at which they occurred so cannot directly supply the timeShard
> portion of our partition key.
>
> I appreciate your input,
>
> Thanks,
>
> Jason
>
> ------------------------------
> *From:* Jack Krupansky <jack.krupansky@gmail.com>
> *To:* "user@cassandra.apache.org" <user@cassandra.apache.org>
> *Sent:* Friday, March 11, 2016 4:45 PM
>
> *Subject:* Re: Strategy for dividing wide rows beyond just adding to the
> partition key
>
> I'll stay away from advising on a specific schema per se, but I'll stick
> to the advice that you need to make sure that your queries are depending
> solely on the columns of the primary key or relatively short slices/scans,
> rather than run the risk of very long scans or having to process multiple
> partitions for a single query. That's canned to some extent, but still
> essential.
>
> Of course we generally wish to avoid hotspots, but with time series they
> are unavoidable. I mean, sure you could place successive events at separate
> partitions, but then you can't do any kind of scanning/slicing.
>
> But, events for separate sensors are not true hotspots in the traditional
> sense - unless you have only a single sensor/unit.
>
> After considering your queries, the next step is to consider the
> cardinality of your data - how many sensors, how many units, rate of
> events, etc. That will feedback into queries as well, such as how big a
> slice or scan might be, as well as sizing of partitions.
>
> So, how many sensor units do you expect, how many sensors per unit, and
> expected rate of events per sensor?
>
> Try not to jump too quickly to specific solutions - there really is a
> method to understanding all of this other stuff upfront.
>
> -- Jack Krupansky
>
> On Thu, Mar 10, 2016 at 12:39 PM, Jason Kania <jason.kania@ymail.com>
> wrote:
>
> Jack,
>
> Thanks for the response. I don't think I provided enough information and
> used the wrong terminology as your response is more the canned advice is
> response to Cassandra antipatterns.
>
> To make this clearer, this is what we are doing:
>
> create table sensorReadings (
> sensorUnitId int,
> sensorId int,
> time timestamp,
> timeShard int,
> readings blob,
> primary key((sensorUnitId, sensorId, timeShard), time);
>
> where timeShard is a combination of year and week of year
>
> For known time range based queries, this works great. However, the
> specific problem is in knowing the maximum and minimum timeShard values
> when we want to select the entire range of data. Our understanding is that
> if we update another related table with the maximum and minimum timeShard
> value for a given sensorUnitId and sensorId combination, we will create a
> hotspot and lots of tombstones. If we SELECT DISTINCT, we get a huge list
> of partition keys for the table because we cannot reduce the scope with a
> where clause.
>
> If there is a recommended pattern that solves this, we haven't come across
> it.
>
> I hope makes the problem clearer.
>
> Thanks,
>
> Jason
>
> ------------------------------
> *From:* Jack Krupansky <jack.krupansky@gmail.com>
> *To:* user@cassandra.apache.org; Jason Kania <jason.kania@ymail.com>
> *Sent:* Thursday, March 10, 2016 10:42 AM
> *Subject:* Re: Strategy for dividing wide rows beyond just adding to the
> partition key
>
> There is an effort underway to support wider rows:
> https://issues.apache.org/jira/browse/CASSANDRA-9754
>
> This won't help you now though. Even with that improvement you still may
> need a more optimal data model since large-scale scanning/filtering is
> always a very bad idea with Cassandra.
>
> The data modeling methodology for Cassandra dictates that queries drive
> the data model and that each form of query requires a separate table
> ("query table.") Materialized view can automate that process for a lot of
> cases, but in any case it does sound as if some of your queries do require
> additional tables.
>
> As a general proposition, Cassandra should not be used for heavy filtering
> - query tables with the filtering criteria baked into the PK is the way to
> go.
>
>
> -- Jack Krupansky
>
> On Thu, Mar 10, 2016 at 8:54 AM, Jason Kania <jason.kania@ymail.com>
> wrote:
>
> Hi,
>
> We have sensor input that creates very wide rows and operations on these
> rows have started to timeout regulary. We have been trying to find a
> solution to dividing wide rows but keep hitting limitations that move the
> problem around instead of solving it.
>
> We have a partition key consisting of a sensorUnitId and a sensorId and
> use a time field to access each column in the row. We tried adding a time
> based entry, timeShardId, to the partition key that consists of the year
> and week of year during which the reading was taken. This works for a
> number of queries but for scanning all the readings against a particular
> sensorUnitId and sensorId combination, we seem to be stuck.
>
> We won't know the range of valid values of the timeShardId for a given
> sensorUnitId and sensorId combination so would have to write to an
> additional table to track the valid timeShardId. We suspect this would
> create tombstone accumulation problems given the number of updates required
> to the same row so haven't tried this option.
>
> Alternatively, we hit a different bottleneck in the form of SELECT
> DISTINCT in trying to directly access the partition keys. Since SELECT
> DISTINCT does not allow for a where clause to filter on the partition key
> values, we have to filter several hundred thousand partition keys just to
> find those related to the relevant sensorUnitId and sensorId. This problem
> will only grow worse for us.
>
> Are there any other approaches that can be suggested? We have been looking
> around, but haven't found any references beyond the initial suggestion to
> add some sort of shard id to the partition key to handle wide rows.
>
> Thanks,
>
> Jason
>
>
>
>
>
>
>
>

--001a1144f9128dae4d052dce35de
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Thanks for the additional information, but there is still =
not enough color on the queries and too much focus on a premature data mode=
l.<div><br></div><div>Is this 5000 readings for a single sensor of a single=
 sensor unit, or for all sensors of a specified unit, or... both?</div><div=
><br></div><div>I presume you want &quot;next&quot; and &quot;previous&quot=
; 5000 readings as well as first and last, but... you will have to confirm =
that.</div><div><br></div><div>One technique is to store the bulk of your r=
aw sensor data in a separate table and then simply store the PK of that dat=
a in your time series. That way you can have a much wider row of time serie=
s (number of rows) without hitting a bulk size issue for the partition. But=
... I don&#39;t want to jump to solutions until we have a firmer handle on =
the query side of the fence.</div></div><div class=3D"gmail_extra"><br clea=
r=3D"all"><div><div class=3D"gmail_signature"><div dir=3D"ltr">-- Jack Krup=
ansky</div></div></div>
<br><div class=3D"gmail_quote">On Fri, Mar 11, 2016 at 5:37 PM, Jason Kania=
 <span dir=3D"ltr">&lt;<a href=3D"mailto:jason.kania@ymail.com" target=3D"_=
blank">jason.kania@ymail.com</a>&gt;</span> wrote:<br><blockquote class=3D"=
gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-=
left:1ex"><div><div style=3D"color:#000;background-color:#fff;font-family:H=
elveticaNeue,Helvetica Neue,Helvetica,Arial,Lucida Grande,sans-serif;font-s=
ize:16px"><span class=3D""><div><span>Jack,</span></div><div><br><span></sp=
an></div><div dir=3D"ltr"><span>Thanks for the response.</span></div><div d=
ir=3D"ltr"><span><br></span></div></span><div dir=3D"ltr"><span>We are targ=
eting our database design to 10000 sensor units and each sensor unit has 32=
 sensors. We are seeing about 700 events per day per sensor, each providing=
 about 2K of data. Based on keeping each partition to about 10 Mb (based on=
 readings we saw on performance), we chose to break our partitions on a wee=
kly basis. This is possibly finer than we need as we were seeing timeouts o=
nly once a single partition was about 150Mb in size<br></span></div><div di=
r=3D"ltr"><br><span></span></div><div dir=3D"ltr"><span>When pulling in dat=
a, we will typically need to pull 1 to 4 months of data for our analysis an=
d will use only the sensorUnitId and sensorId to uniquely identify the data=
 source with the timeShard value used to break up our partitions. We have h=
andling to sequentially scan based on our &quot;timeShard&quot; value, but =
don&#39;t have a good handle on the determination of the &quot;timeShard&qu=
ot; portion of the partition key at read time. The data starts coming in wh=
en a subscriber starts using our system and finishes when they discontinue =
service or put the service on hold temporarily.<br></span></div><div><br></=
div><div dir=3D"ltr">When I talk about hotspots, it isn&#39;t the time seri=
es data that is the concern, it is with respect to storing the maximum and =
minimum timeShard values in another table for subsequent lookup or the cost=
 of running the current implementation of SELECT DISTINCT. We need to run q=
ueries such as getting the first or last 5000 sensor readings when we don&#=
39;t know the time frame at which they occurred so cannot directly supply t=
he timeShard portion of our partition key.<br><span></span></div><div><br><=
span></span></div><div><div>I appreciate your input,</div><div><br></div><d=
iv>Thanks,</div><div><br></div><div>Jason<br></div><br></div><div style=3D"=
display:block">  <div style=3D"font-family:HelveticaNeue,Helvetica Neue,Hel=
vetica,Arial,Lucida Grande,sans-serif;font-size:16px"> <div style=3D"font-f=
amily:HelveticaNeue,Helvetica Neue,Helvetica,Arial,Lucida Grande,sans-serif=
;font-size:16px"> <div dir=3D"ltr"> <font face=3D"Arial" size=3D"2"><span c=
lass=3D""> <hr size=3D"1"> <b><span style=3D"font-weight:bold">From:</span>=
</b> Jack Krupansky &lt;<a href=3D"mailto:jack.krupansky@gmail.com" target=
=3D"_blank">jack.krupansky@gmail.com</a>&gt;<br> </span><b><span style=3D"f=
ont-weight:bold">To:</span></b> &quot;<a href=3D"mailto:user@cassandra.apac=
he.org" target=3D"_blank">user@cassandra.apache.org</a>&quot; &lt;<a href=
=3D"mailto:user@cassandra.apache.org" target=3D"_blank">user@cassandra.apac=
he.org</a>&gt; <br> <b><span style=3D"font-weight:bold">Sent:</span></b> Fr=
iday, March 11, 2016 4:45 PM<div><div class=3D"h5"><br> <b><span style=3D"f=
ont-weight:bold">Subject:</span></b> Re: Strategy for dividing wide rows be=
yond just adding to the partition key<br> </div></div></font> </div><div><d=
iv class=3D"h5"> <div><br><div><div><div dir=3D"ltr">I&#39;ll stay away fro=
m advising on a specific schema per se, but I&#39;ll stick to the advice th=
at you need to make sure that your queries are depending solely on the colu=
mns of the primary key or relatively short slices/scans, rather than run th=
e risk of very long scans or having to process multiple partitions for a si=
ngle query. That&#39;s canned to some extent, but still essential.<div><br =
clear=3D"none"></div><div>Of course we generally wish to avoid hotspots, bu=
t with time series they are unavoidable. I mean, sure you could place succe=
ssive events at separate partitions, but then you can&#39;t do any kind of =
scanning/slicing.</div><div><br clear=3D"none"></div><div>But, events for s=
eparate sensors are not true hotspots in the traditional sense - unless you=
 have only a single sensor/unit.</div><div><br clear=3D"none"></div><div>Af=
ter considering your queries, the next step is to consider the cardinality =
of your data - how many sensors, how many units, rate of events, etc. That =
will feedback into queries as well, such as how big a slice or scan might b=
e, as well as sizing of partitions.</div><div><br clear=3D"none"></div><div=
>So, how many sensor units do you expect, how many sensors per unit, and ex=
pected rate of events per sensor?</div><div><br clear=3D"none"></div><div>T=
ry not to jump too quickly to specific solutions - there really is a method=
 to understanding all of this other stuff upfront.</div><div><br clear=3D"a=
ll"><div><div><div dir=3D"ltr">-- Jack Krupansky</div></div></div>
<br clear=3D"none"><div><div>On Thu, Mar 10, 2016 at 12:39 PM, Jason Kania =
<span dir=3D"ltr">&lt;<a rel=3D"nofollow" shape=3D"rect" href=3D"mailto:jas=
on.kania@ymail.com" target=3D"_blank">jason.kania@ymail.com</a>&gt;</span> =
wrote:<br clear=3D"none"><blockquote style=3D"margin:0 0 0 .8ex;border-left=
:1px #ccc solid;padding-left:1ex"><div><div style=3D"color:#000;background-=
color:#fff;font-family:HelveticaNeue,Helvetica Neue,Helvetica,Arial,Lucida =
Grande,sans-serif;font-size:16px"><div><span>Jack,</span></div><div><br cle=
ar=3D"none"><span></span></div><div><span>Thanks for the response. I don=
9;t think I provided enough information and used the wrong terminology as y=
our response is more the canned advice is response to Cassandra antipattern=
s.</span></div><div><span></span><div><br clear=3D"none"></div><div><span>T=
o make this clearer, this is what we are doing:</span></div><div dir=3D"ltr=
"><br clear=3D"none"><span></span></div><div><div dir=3D"ltr">create table =
sensorReadings (</div><div dir=3D"ltr">sensorUnitId int,<br clear=3D"none">=
</div></div><div>sensorId int,</div><div dir=3D"ltr">time timestamp,</div><=
div dir=3D"ltr">timeShard int,<br clear=3D"none"></div><div>readings blob,<=
/div><div><span></span><div dir=3D"ltr">primary key((sensorUnitId, sensorId=
, timeShard), time);</div><div dir=3D"ltr"><br clear=3D"none"></div><div di=
r=3D"ltr"><div dir=3D"ltr">where timeShard is a combination of year and wee=
k of year</div></div></div><div><br clear=3D"none"></div><div>For known tim=
e range based queries, this works great. However, the specific problem is i=
n knowing the maximum and minimum timeShard values when we want to select t=
he entire range of data. Our understanding is that if we update another rel=
ated table with the maximum and minimum timeShard value for a given sensorU=
nitId and sensorId combination, we will create a hotspot and lots of tombst=
ones. If we SELECT DISTINCT, we get a huge list of partition keys for the t=
able because we cannot reduce the scope with a where clause.<br clear=3D"no=
ne"></div><div><br clear=3D"none"></div><div dir=3D"ltr">If there is a reco=
mmended pattern that solves this, we haven&#39;t come across it.<br clear=
=3D"none"></div><div><br clear=3D"none"></div><div>I hope makes the problem=
 clearer.</div><div><br clear=3D"none"></div><div>Thanks,</div><div><br cle=
ar=3D"none"></div><div>Jason<br clear=3D"none"></div><div><br clear=3D"none=
"></div></div><div style=3D"display:block">  <div style=3D"font-family:Helv=
eticaNeue,Helvetica Neue,Helvetica,Arial,Lucida Grande,sans-serif;font-size=
:16px"> <div style=3D"font-family:HelveticaNeue,Helvetica Neue,Helvetica,Ar=
ial,Lucida Grande,sans-serif;font-size:16px"> <div dir=3D"ltr"> <font face=
=3D"Arial" size=3D"2"> </font><hr size=3D"1"> <b><span style=3D"font-weight=
:bold">From:</span></b> Jack Krupansky &lt;<a rel=3D"nofollow" shape=3D"rec=
t" href=3D"mailto:jack.krupansky@gmail.com" target=3D"_blank">jack.krupansk=
y@gmail.com</a>&gt;<span><br clear=3D"none"> <b><span style=3D"font-weight:=
bold">To:</span></b> <a rel=3D"nofollow" shape=3D"rect" href=3D"mailto:user=
@cassandra.apache.org" target=3D"_blank">user@cassandra.apache.org</a>; Jas=
on Kania &lt;<a rel=3D"nofollow" shape=3D"rect" href=3D"mailto:jason.kania@=
ymail.com" target=3D"_blank">jason.kania@ymail.com</a>&gt; <br clear=3D"non=
e"> </span><b><span style=3D"font-weight:bold">Sent:</span></b> Thursday, M=
arch 10, 2016 10:42 AM<span><br clear=3D"none"> <b><span style=3D"font-weig=
ht:bold">Subject:</span></b> Re: Strategy for dividing wide rows beyond jus=
t adding to the partition key<br clear=3D"none"> </span> </div><div><div> <=
div><br clear=3D"none"><div><div><div dir=3D"ltr">There is an effort underw=
ay to support wider rows:<div><a rel=3D"nofollow" shape=3D"rect" href=3D"ht=
tps://issues.apache.org/jira/browse/CASSANDRA-9754" style=3D"font-size:12.8=
px" target=3D"_blank">https://issues.apache.org/jira/browse/CASSANDRA-9754<=
/a><br clear=3D"none"></div><div><br clear=3D"none"></div><div>This won&#39=
;t help you now though. Even with that improvement you still may need a mor=
e optimal data model since large-scale scanning/filtering is always a very =
bad idea with Cassandra.</div><div><br clear=3D"none"></div><div>The data m=
odeling methodology for Cassandra dictates that queries drive the data mode=
l and that each form of query requires a separate table (&quot;query table.=
&quot;) Materialized view can automate that process for a lot of cases, but=
 in any case it does sound as if some of your queries do require additional=
 tables.</div><div><br clear=3D"none"></div><div>As a general proposition, =
Cassandra should not be used for heavy filtering - query tables with the fi=
ltering criteria baked into the PK is the way to go.</div><div><br clear=3D=
"none"></div></div><div><br clear=3D"all"><div><div><div dir=3D"ltr">-- Jac=
k Krupansky</div></div></div>
<br clear=3D"none"><div><div>On Thu, Mar 10, 2016 at 8:54 AM, Jason Kania <=
span dir=3D"ltr">&lt;<a rel=3D"nofollow" shape=3D"rect" href=3D"mailto:jaso=
n.kania@ymail.com" target=3D"_blank">jason.kania@ymail.com</a>&gt;</span> w=
rote:<br clear=3D"none"><blockquote style=3D"margin:0 0 0 .8ex;border-left:=
1px #ccc solid;padding-left:1ex"><div><div style=3D"color:#000;background-c=
olor:#fff;font-family:HelveticaNeue,Helvetica Neue,Helvetica,Arial,Lucida G=
rande,sans-serif;font-size:16px"><div>Hi,</div><div><br clear=3D"none"></di=
v><div>We have sensor input that creates very wide rows and operations on t=
hese rows have started to timeout regulary. We have been trying to find a s=
olution to dividing wide rows but keep hitting limitations that move the pr=
oblem around instead of solving it.</div><div><br clear=3D"none"></div><div=
 dir=3D"ltr">We have a partition key consisting of a sensorUnitId and a sen=
sorId and use a time field to access each column in the row. We tried addin=
g a time based entry, timeShardId, to the partition key that consists of th=
e year and week of year during which the reading was taken. This works for =
a number of queries but for scanning all the readings against a particular =
sensorUnitId and sensorId combination, we seem to be stuck.</div><div dir=
=3D"ltr"><br clear=3D"none"></div><div dir=3D"ltr">We won&#39;t know the ra=
nge of valid values of the timeShardId for a given sensorUnitId and sensorI=
d combination so would have to write to an additional table to track the va=
lid timeShardId. We suspect this would create tombstone accumulation proble=
ms given the number of updates required to the same row so haven&#39;t trie=
d this option.<br clear=3D"none"></div><div dir=3D"ltr"><br clear=3D"none">=
</div><div dir=3D"ltr">Alternatively, we hit a different bottleneck in the =
form of SELECT DISTINCT in trying to directly access the partition keys. Si=
nce SELECT DISTINCT does not allow for a where clause to filter on the part=
ition key values, we have to filter several hundred thousand partition keys=
 just to find those related to the relevant sensorUnitId and sensorId. This=
 problem will only grow worse for us.<br clear=3D"none"></div><div dir=3D"l=
tr"><br clear=3D"none"></div><div dir=3D"ltr">Are there any other approache=
s that can be suggested? We have been looking around, but haven&#39;t found=
 any references beyond the initial suggestion to add some sort of shard id =
to the partition key to handle wide rows.</div><div dir=3D"ltr"><br clear=
=3D"none"></div><div dir=3D"ltr">Thanks,</div><div dir=3D"ltr"><br clear=3D=
"none"></div><div dir=3D"ltr">Jason<br clear=3D"none"></div></div></div></b=
lockquote></div></div><br clear=3D"none"></div></div></div><br clear=3D"non=
e"><br clear=3D"none"></div> </div></div></div> </div>  </div></div></div><=
/blockquote></div></div><br clear=3D"none"></div></div></div></div><br><br>=
</div> </div></div></div> </div>  </div></div></div></blockquote></div><br>=
</div>

--001a1144f9128dae4d052dce35de--