Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of mightye@gmail.com designates
 209.85.213.181 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <71B4A6DC-CE4D-43F5-8CBE-B6FCCF68475B@vast.com>
References: 
 <CACz-ikvTQzqSvZiV4L1y-4L3FvDs=+E7LNNML+8e1pPx1H3aPg@mail.gmail.com>
 <1425390004.830307.234875049.55B00794@webmail.messagingengine.com>
 <1425399055.871834.234948865.2E0D7303@webmail.messagingengine.com>
 <CACz-ikvErOS5DircNgHr80u-da2GB_yPWuXG0fSUp=6_Lb_YSA@mail.gmail.com>
 <71B4A6DC-CE4D-43F5-8CBE-B6FCCF68475B@vast.com>
From: Eric Stevens <mightye@gmail.com>
Date: Sat, 7 Mar 2015 15:19:41 -0700
Message-ID: 
 <CAORswtwP_DTdKffM6y4ekLHmc0uOQc9=aUBdNCGWUSH8wSfoqw@mail.gmail.com>
Subject: Re: best practices for time-series data with massive amounts of
 records
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Content-Type: multipart/alternative; boundary=089e013a1118d8a8600510ba33cc

--089e013a1118d8a8600510ba33cc
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

It's probably quite rare for extremely large time series data to be
querying the whole set of data.  Instead there's almost always a "Between X
and Y dates" aspect to nearly every real time query you might have against
a table like this (with the exception of "most recent N events").

Because of this, time bucketing can be an effective strategy, though until
you understand your data better, it's hard to know how large (or small) to
make your buckets.  Because of *that*, I recommend using timestamp data
type for your bucketing strategy - this gives you the advantage of being
able to reduce your bucket sizes while keeping your at-rest data mostly
still quite accessible.

What I mean is that if you change your bucketing strategy from day to hour,
when you are querying across that changed time period, you can iterate at
the finer granularity buckets (hour), and you'll pick up the coarser
granularity (day) automatically for all but the earliest bucket (which is
easy to correct for when you're flooring your start bucket).  In the
coarser time period, most reads are partition key misses, which are
extremely inexpensive in Cassandra.

If you do need most-recent-N queries for broad ranges and you expect to
have some users whose clickrate is dramatically less frequent than your
bucket interval (making iterating over buckets inefficient), you can keep a
separate counter table with PK of ((user_id), bucket) in which you count
new events.  Now you can identify the exact set of buckets you need to read
to satisfy the query no matter what the user's click volume is (so very low
volume users have at most N partition keys queried, higher volume users
query fewer partition keys).

On Fri, Mar 6, 2015 at 4:06 PM, graham sanderson <graham@vast.com> wrote:

> Note that using static column(s) for the =E2=80=9Chead=E2=80=9D value, an=
d trailing TTLed
> values behind is something we=E2=80=99re considering. Note this is especi=
ally nice
> if your head state includes say a map which is updated by small deltas
> (individual keys)
>
> We have not yet studied the effect of static columns on say DTCS
>
>
> On Mar 6, 2015, at 4:42 PM, Clint Kelly <clint.kelly@gmail.com> wrote:
>
> Hi all,
>
> Thanks for the responses, this was very helpful.
>
> I don't know yet what the distribution of clicks and users will be, but I
> expect to see a few users with an enormous amount of interactions and mos=
t
> users having very few.  The idea of doing some additional manual
> partitioning, and then maintaining another table that contains the "head"
> partition for each user makes sense, although it would add additional
> latency when we want to get say the most recent 1000 interactions for a
> given user (which is something that we have to do sometimes for
> applications with tight SLAs).
>
> FWIW I doubt that any users will have so many interactions that they
> exceed what we could reasonably put in a row, but I wanted to have a
> strategy to deal with this.
>
> Having a nice design pattern in Cassandra for maintaining a row with the
> N-most-recent interactions would also solve this reasonably well, but I
> don't know of any way to implement that without running batch jobs that
> periodically clean out data (which might be okay).
>
> Best regards,
> Clint
>
>
>
>
> On Tue, Mar 3, 2015 at 8:10 AM, mck <mck@apache.org> wrote:
>
>>
>> > Here "partition" is a random digit from 0 to (N*M)
>> > where N=3Dnodes in cluster, and M=3Darbitrary number.
>>
>>
>> Hopefully it was obvious, but here (unless you've got hot partitions),
>> you don't need N.
>> ~mck
>>
>
>
>

--089e013a1118d8a8600510ba33cc
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">It&#39;s probably quite rare for extremely large time seri=
es data to be querying the whole set of data.=C2=A0 Instead there&#39;s alm=
ost always a &quot;Between X and Y dates&quot; aspect to nearly every real =
time query you might have against a table like this (with the exception of =
&quot;most recent N events&quot;).<div><br></div><div>Because of this, time=
 bucketing can be an effective strategy, though until you understand your d=
ata better, it&#39;s hard to know how large (or small) to make your buckets=
.=C2=A0 Because of <i>that</i>, I recommend using timestamp data type for y=
our bucketing strategy - this gives you the advantage of being able to redu=
ce your bucket sizes while keeping your at-rest data mostly still quite acc=
essible.</div><div><br></div><div>What I mean is that if you change your bu=
cketing strategy from day to hour, when you are querying across that change=
d time period, you can iterate at the finer granularity buckets (hour), and=
 you&#39;ll pick up the coarser granularity (day) automatically for all but=
 the earliest bucket (which is easy to correct for when you&#39;re flooring=
 your start bucket).=C2=A0 In the coarser time period, most reads are parti=
tion key misses, which are extremely inexpensive in Cassandra.</div><div><b=
r></div><div>If you do need most-recent-N queries for broad ranges and you =
expect to have some users whose clickrate is dramatically less frequent tha=
n your bucket interval (making iterating over buckets inefficient), you can=
 keep a separate counter table with PK of ((user_id), bucket) in which you =
count new events.=C2=A0 Now you can identify the exact set of buckets you n=
eed to read to satisfy the query no matter what the user&#39;s click volume=
 is (so very low volume users have at most N partition keys queried, higher=
 volume users query fewer partition keys).</div></div><div class=3D"gmail_e=
xtra"><br><div class=3D"gmail_quote">On Fri, Mar 6, 2015 at 4:06 PM, graham=
 sanderson <span dir=3D"ltr">&lt;<a href=3D"mailto:graham@vast.com" target=
=3D"_blank">graham@vast.com</a>&gt;</span> wrote:<br><blockquote class=3D"g=
mail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-l=
eft:1ex"><div style=3D"word-wrap:break-word">Note that using static column(=
s) for the =E2=80=9Chead=E2=80=9D value, and trailing TTLed values behind i=
s something we=E2=80=99re considering. Note this is especially nice if your=
 head state includes say a map which is updated by small deltas (individual=
 keys)<div><br></div><div>We have not yet studied the effect of static colu=
mns on say DTCS<div><div class=3D"h5"><br><div><br><div><blockquote type=3D=
"cite"><div>On Mar 6, 2015, at 4:42 PM, Clint Kelly &lt;<a href=3D"mailto:c=
lint.kelly@gmail.com" target=3D"_blank">clint.kelly@gmail.com</a>&gt; wrote=
:</div><br><div><div dir=3D"ltr">Hi all,<div><br></div><div>Thanks for the =
responses, this was very helpful.</div><div><br></div><div>I don&#39;t know=
 yet what the distribution of clicks and users will be, but I expect to see=
 a few users with an enormous amount of interactions and most users having =
very few.=C2=A0 The idea of doing some additional manual partitioning, and =
then maintaining another table that contains the &quot;head&quot; partition=
 for each user makes sense, although it would add additional latency when w=
e want to get say the most recent 1000 interactions for a given user (which=
 is something that we have to do sometimes for applications with tight SLAs=
).</div><div><br></div><div>FWIW I doubt that any users will have so many i=
nteractions that they exceed what we could reasonably put in a row, but I w=
anted to have a strategy to deal with this.</div><div><br></div><div>Having=
 a nice design pattern in Cassandra for maintaining a row with the N-most-r=
ecent interactions would also solve this reasonably well, but I don&#39;t k=
now of any way to implement that without running batch jobs that periodical=
ly clean out data (which might be okay).</div><div><br></div><div><div>Best=
 regards,</div><div>Clint</div></div><div><br></div><div><br></div><div><br=
></div></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On T=
ue, Mar 3, 2015 at 8:10 AM, mck <span dir=3D"ltr">&lt;<a href=3D"mailto:mck=
@apache.org" target=3D"_blank">mck@apache.org</a>&gt;</span> wrote:<br><blo=
ckquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #c=
cc solid;padding-left:1ex"><span><br>
&gt; Here &quot;partition&quot; is a random digit from 0 to (N*M)<br>
&gt; where N=3Dnodes in cluster, and M=3Darbitrary number.<br>
<br>
<br>
</span>Hopefully it was obvious, but here (unless you&#39;ve got hot partit=
ions),<br>
you don&#39;t need N.<br>
<span><font color=3D"#888888">~mck<br>
</font></span></blockquote></div><br></div>
</div></blockquote></div><br></div></div></div></div></div></blockquote></d=
iv><br></div>

--089e013a1118d8a8600510ba33cc--