Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of burtonator2011@gmail.com
 designates 209.85.214.173 as permitted sender)
MIME-Version: 1.0
Sender: burtonator2011@gmail.com
In-Reply-To: 
 <CABNOTEX5HH8CsBdDRkdGFmK9BuLkMf0d2fD2qH4Si4GzO_xoPQ@mail.gmail.com>
References: 
 <CABNOTEVibMGfWaTu7G6k4zhrkP+S8fO7Yk2db8FT=4wXbOXVfQ@mail.gmail.com>
 <CAOxAL61t7FqDqGg=5s99W3ei=i_RkE4aO9+eH0F0o5XO54F5mQ@mail.gmail.com>
 <CABNOTEXKDMTnGPsj7jGFqKWxMbVZUfX9r81PwUzeHYH1Txq8yQ@mail.gmail.com>
 <CAOxAL63G1Lka-0RBeK9YDwcFVeNATBMZb+pVx6QsmwQ30Sq9jw@mail.gmail.com>
 <CABNOTEUVG-xeg0Tf7r5YUbO8oTvQaoWBPAT6U2FBNDYFthj6KA@mail.gmail.com>
 <CAOxAL63_vDV09JxLFzwhF32MEqH_4KHq39P17bgbYaJSOjXobA@mail.gmail.com>
 <CABNOTEX5HH8CsBdDRkdGFmK9BuLkMf0d2fD2qH4Si4GzO_xoPQ@mail.gmail.com>
From: Kevin Burton <burton@spinn3r.com>
Date: Sun, 5 Apr 2015 09:28:01 -0700
Message-ID: 
 <CAAZU44m2nYMT0UN6cqGj1UxOt3mbSh2Qu=7=HVPusHsymChQPw@mail.gmail.com>
Subject: Re: Timeseries analysis using Cassandra and partition by date period
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=e89a8f50316c75c4120512fcaba6

--e89a8f50316c75c4120512fcaba6
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

> Hi, I switched from HBase to Cassandra and try to find problem solution
for timeseries analysis on top Cassandra.

Depending on what you=E2=80=99re looking for, you might want to check out K=
airosDB.

0.95 beta2 just shipped yesterday as well so you have good timing.

https://github.com/kairosdb/kairosdb

On Sat, Apr 4, 2015 at 11:29 AM, Serega Sheypak <serega.sheypak@gmail.com>
wrote:

> Okay, so bucketing by day/week/month is a capacity planning stuff and
> actual questions I want to ask.
> As as a conclusion:
> I have a table events
>
> CREATE TABLE user_plans (
>   id timeuuid,
>   user_id timeuuid,
>   event_ts timestamp,
>   event_type int,
>   some_other_attr text
>
> PRIMARY KEY (user_id, ends)
> );
> which fits tactic queries:
> select smth from user_plans where user_id=3D'xxx' and end_ts > now()
>
> Then I create second table user_plans_daily (or weekly, monthy)
>
> with DDL:
> CREATE TABLE user_plans_daily/weekly/monthly (
>   ymd int,
>   user_id timeuuid,
>   event_ts timestamp,
>   event_type int,
>   some_other_attr text
> )
> PRIMARY KEY ((ymd, user_id), event_ts )
> WITH CLUSTERING ORDER BY (event_ts DESC);
>
> And this table is good for answering strategic questions:
> select * from
> user_plans_daily/weekly/monthly
> where ymd in (....)
> And I should avoid long condition inside IN clause, that is why you
> suggest me to create bigger bucket, correct?
>
>
> 2015-04-04 20:00 GMT+02:00 Jack Krupansky <jack.krupansky@gmail.com>:
>
>> It sounds like your time bucket should be a month, but it depends on the
>> amount of data per user per day and your main query range. Within the
>> partition you can then query for a range of days.
>>
>> Yes, all of the rows within a partition are stored on one physical node
>> as well as the replica nodes.
>>
>> -- Jack Krupansky
>>
>> On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheypak <serega.sheypak@gmail.com=
>
>> wrote:
>>
>>> >non-equal relation on a partition key is not supported
>>> Ok, can I generate select query:
>>> select some_attributes
>>> from events where ymd =3D 20150101 or ymd =3D 20150102 or 20150103 ... =
or
>>> 20150331
>>>
>>> > The partition key determines which node can satisfy the query
>>> So you mean that all rows with the same *(ymd, user_id)* would be on
>>> one physical node?
>>>
>>>
>>> 2015-04-04 16:38 GMT+02:00 Jack Krupansky <jack.krupansky@gmail.com>:
>>>
>>>> Unfortunately, a non-equal relation on a partition key is not
>>>> supported. You would need to bucket by some larger unit, like a month,=
 and
>>>> then use the date/time as a clustering column for the row key. Then yo=
u
>>>> could query within the partition. The partition key determines which n=
ode
>>>> can satisfy the query. Designing your partition key judiciously is the=
 key
>>>> (haha!) to performant Cassandra applications.
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak <
>>>> serega.sheypak@gmail.com> wrote:
>>>>
>>>>> Hi, we plan to have 10^8 users and each user could generate 10 events
>>>>> per day.
>>>>> So we have:
>>>>> 10^8 records per day
>>>>> 10^8*30 records per month.
>>>>> Our timewindow analysis could be from 1 to 6 months.
>>>>>
>>>>> Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts
>>>>> of event.
>>>>>
>>>>> So you suggest this approach:
>>>>> *PRIMARY KEY ((ymd, user_id), event_ts ) *
>>>>> *WITH CLUSTERING ORDER BY (**event_ts*
>>>>> * DESC);*
>>>>>
>>>>> where ymd=3D20150102 (the Second of January)?
>>>>>
>>>>> *What happens to writes:*
>>>>> SSTable with past days (ymd < current_day) stay untouched and don't
>>>>> take part in Compaction process since there are o changes to them?
>>>>>
>>>>> What happens to read:
>>>>> I issue query:
>>>>> select some_attributes
>>>>> from events where ymd >=3D 20150101 and ymd < 20150301
>>>>> Does Cassandra skip SSTables which don't have ymd in specified range
>>>>> and give me a kind of partition elimination, like in traditional DBs?
>>>>>
>>>>>
>>>>> 2015-04-04 14:41 GMT+02:00 Jack Krupansky <jack.krupansky@gmail.com>:
>>>>>
>>>>>> It depends on the actual number of events per user, but simply
>>>>>> bucketing the partition key can give you the same effect - clusterin=
g rows
>>>>>> by time range. A composite partition key could be comprised of the u=
ser
>>>>>> name and the date.
>>>>>>
>>>>>> It also depends on the data rate - is it many events per day or just
>>>>>> a few events per week, or over what time period. You need to be care=
ful -
>>>>>> you don't want your Cassandra partitions to be too big (millions of =
rows)
>>>>>> or too small (just a few or even one row per partition.)
>>>>>>
>>>>>> -- Jack Krupansky
>>>>>>
>>>>>> On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak <
>>>>>> serega.sheypak@gmail.com> wrote:
>>>>>>
>>>>>>> Hi, I switched from HBase to Cassandra and try to find problem
>>>>>>> solution for timeseries analysis on top Cassandra.
>>>>>>> I have a entity named "Event".
>>>>>>> "Event" has attributes:
>>>>>>> user_id - a guy who triggered event
>>>>>>> event_ts - when even happened
>>>>>>> event_type - type of event
>>>>>>> some_other_attr - some other attrs we don't care about right now.
>>>>>>>
>>>>>>> The DDL for entity event looks this way:
>>>>>>>
>>>>>>> CREATE TABLE user_plans (
>>>>>>>
>>>>>>>   id timeuuid,
>>>>>>>   user_id timeuuid,
>>>>>>>   event_ts timestamp,
>>>>>>>   event_type int,
>>>>>>>   some_other_attr text
>>>>>>>
>>>>>>> PRIMARY KEY (user_id, ends)
>>>>>>> );
>>>>>>>
>>>>>>> Table is "infinite", It would grow continuously during application
>>>>>>> lifetime.
>>>>>>> I want to ask question:
>>>>>>> Cassandra, give me all event where event_ts >=3D xxx
>>>>>>> and event_ts <=3Dyyy.
>>>>>>>
>>>>>>> Right now it would lead to full table scan.
>>>>>>>
>>>>>>> There is a trick in HBase. HBase has table abstraction and HBase ha=
s
>>>>>>> Column Family abstraction.
>>>>>>> Column family should be declared in advance.
>>>>>>> Column family - physically is a pack of HFiles ("SSTables in C*").
>>>>>>> So I can easily add partitioning for my HBase table:
>>>>>>> alter table hbase_events add column familiy '2015_01'
>>>>>>> and store all 2015 January data to Column familiy named '2015_01'.
>>>>>>>
>>>>>>> When I want to get January data, I would directly access column
>>>>>>> family named '2015_01' and I won't massage all data in table, just =
this
>>>>>>> piece.
>>>>>>>
>>>>>>> What is approach in C* in this case?
>>>>>>> I have an idea create several tables: event_2015_01, event_2015_02,
>>>>>>> e.t.c. but it looks rather ugly from my current understanding how i=
t works.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


--=20

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
=E2=80=A6 or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>

--e89a8f50316c75c4120512fcaba6
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><span style=3D"font-size:13px">&gt; Hi, I switched from HB=
ase to Cassandra and try to find problem solution for timeseries analysis o=
n top Cassandra.</span><br><div><span style=3D"font-size:13px"><br></span><=
/div><div>Depending on what you=E2=80=99re looking for, you might want to c=
heck out KairosDB.</div><div><br></div><div>0.95 beta2 just shipped yesterd=
ay as well so you have good timing.</div><div><br></div><div><a href=3D"htt=
ps://github.com/kairosdb/kairosdb">https://github.com/kairosdb/kairosdb</a>=
<br></div></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">O=
n Sat, Apr 4, 2015 at 11:29 AM, Serega Sheypak <span dir=3D"ltr">&lt;<a hre=
f=3D"mailto:serega.sheypak@gmail.com" target=3D"_blank">serega.sheypak@gmai=
l.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"m=
argin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"l=
tr">Okay, so bucketing by day/week/month is a capacity planning stuff and a=
ctual questions I want to ask.=C2=A0<div>As as a conclusion:</div><div>I ha=
ve a table events</div><span class=3D""><div><br></div><div><div style=3D"f=
ont-size:12.8000001907349px">CREATE TABLE user_plans (</div><div style=3D"f=
ont-size:12.8000001907349px">=C2=A0 id timeuuid,</div><div style=3D"font-si=
ze:12.8000001907349px">=C2=A0 user_id timeuuid,</div><div style=3D"font-siz=
e:12.8000001907349px">=C2=A0 event_ts timestamp,</div><div style=3D"font-si=
ze:12.8000001907349px">=C2=A0 event_type int,</div><div style=3D"font-size:=
12.8000001907349px">=C2=A0 some_other_attr text<br></div><div style=3D"font=
-size:12.8000001907349px">=C2=A0=C2=A0</div><div style=3D"font-size:12.8000=
001907349px">PRIMARY KEY (user_id, ends)</div><div style=3D"font-size:12.80=
00001907349px">);</div></div></span><div style=3D"font-size:12.800000190734=
9px">which fits tactic queries:=C2=A0</div><div style=3D"font-size:12.80000=
01907349px">select smth from=C2=A0<span style=3D"font-size:12.8000001907349=
px">user_plans where user_id=3D&#39;xxx&#39; and end_ts &gt; now()</span></=
div><div style=3D"font-size:12.8000001907349px"><span style=3D"font-size:12=
.8000001907349px"><br></span></div><div style=3D"font-size:12.8000001907349=
px"><span style=3D"font-size:12.8000001907349px">Then I create second table=
=C2=A0</span><span style=3D"font-size:12.8000001907349px">user_plans_daily =
(or weekly, monthy)</span></div><div style=3D"font-size:12.8000001907349px"=
><span style=3D"font-size:12.8000001907349px"><br></span></div><div style=
=3D"font-size:12.8000001907349px"><span style=3D"font-size:12.8000001907349=
px">with DDL:</span></div><div style=3D"font-size:12.8000001907349px"><div =
style=3D"font-size:12.8000001907349px">CREATE TABLE user_plans_daily/weekly=
/monthly (</div><div style=3D"font-size:12.8000001907349px">=C2=A0 ymd int,=
</div><span class=3D""><div style=3D"font-size:12.8000001907349px">=C2=A0 u=
ser_id timeuuid,</div><div style=3D"font-size:12.8000001907349px">=C2=A0 ev=
ent_ts timestamp,</div><div style=3D"font-size:12.8000001907349px">=C2=A0 e=
vent_type int,</div><div style=3D"font-size:12.8000001907349px">=C2=A0 some=
_other_attr text</div></span><div style=3D"font-size:12.8000001907349px"><s=
pan style=3D"font-size:12.8000001907349px">) =C2=A0</span></div><span class=
=3D""><div style=3D"font-size:12.8000001907349px"><div style=3D"font-size:1=
2.8000001907349px"><div style=3D"font-size:12.8000001907349px"><span style=
=3D"font-size:12.8000001907349px">PRIMARY KEY ((</span><span style=3D"font-=
size:12.8000001907349px">ymd,=C2=A0</span><span style=3D"font-size:12.80000=
01907349px">user_id), event_ts )=C2=A0</span></div><div style=3D"font-size:=
12.8000001907349px">WITH CLUSTERING ORDER BY (event_ts=C2=A0DESC);</div><di=
v><br></div></div></div></span></div><div><div style=3D"font-size:12.800000=
1907349px">And this table is good for answering strategic questions:=C2=A0<=
/div><div style=3D"font-size:12.8000001907349px">select * from=C2=A0</div><=
div style=3D"font-size:12.8000001907349px"><span style=3D"font-size:12.8000=
001907349px">user_plans_daily/weekly/monthly</span><br></div><div style=3D"=
font-size:12.8000001907349px"><span style=3D"font-size:12.8000001907349px">=
where ymd in (....)</span></div><div><span style=3D"font-size:12.8000001907=
349px">And I=C2=A0should=C2=A0avoid long condition inside IN clause, that i=
s why you suggest me to create bigger bucket, correct?</span></div></div><d=
iv><br></div></div><div class=3D"HOEnZb"><div class=3D"h5"><div class=3D"gm=
ail_extra"><br><div class=3D"gmail_quote">2015-04-04 20:00 GMT+02:00 Jack K=
rupansky <span dir=3D"ltr">&lt;<a href=3D"mailto:jack.krupansky@gmail.com" =
target=3D"_blank">jack.krupansky@gmail.com</a>&gt;</span>:<br><blockquote c=
lass=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;=
padding-left:1ex"><div dir=3D"ltr">It sounds like your time bucket should b=
e a month, but it depends on the amount of data per user per day and your m=
ain query range. Within the partition you can then query for a range of day=
s.<div><br></div><div>Yes, all of the rows within a partition are stored on=
 one physical node as well as the replica nodes.</div></div><div class=3D"g=
mail_extra"><span><font color=3D"#888888"><br clear=3D"all"><div><div><div =
dir=3D"ltr">-- Jack Krupansky</div></div></div></font></span><div><div>
<br><div class=3D"gmail_quote">On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheyp=
ak <span dir=3D"ltr">&lt;<a href=3D"mailto:serega.sheypak@gmail.com" target=
=3D"_blank">serega.sheypak@gmail.com</a>&gt;</span> wrote:<br><blockquote c=
lass=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;=
padding-left:1ex"><div dir=3D"ltr"><span>&gt;<span style=3D"font-size:12.80=
00001907349px">non-equal relation on a partition key is not supported</span=
></span><div><span style=3D"font-size:12.8000001907349px">Ok, can I generat=
e select query:</span></div><div><div style=3D"font-size:12.8000001907349px=
"><span style=3D"font-size:12.8000001907349px">select some_attributes=C2=A0=
</span></div><div style=3D"font-size:12.8000001907349px"><span style=3D"fon=
t-size:12.8000001907349px">from events where ymd =3D 20150101 or ymd =3D=C2=
=A0</span><span style=3D"font-size:12.8000001907349px">20150102 or=C2=A0</s=
pan><span style=3D"font-size:12.8000001907349px">20150103 ... or=C2=A0</spa=
n><span style=3D"font-size:12.8000001907349px">20150331</span></div></div><=
span><div style=3D"font-size:12.8000001907349px"><span style=3D"font-size:1=
2.8000001907349px"><br></span></div><div style=3D"font-size:12.800000190734=
9px">&gt;<span style=3D"font-size:12.8000001907349px">=C2=A0</span><span st=
yle=3D"font-size:12.8000001907349px">The partition key determines which nod=
e can satisfy the query</span></div></span><div style=3D"font-size:12.80000=
01907349px"><span style=3D"font-size:12.8000001907349px">So you mean that a=
ll rows with the same=C2=A0</span><b style=3D"font-size:12.8000001907349px"=
><span style=3D"font-size:12.8000001907349px">(</span><span style=3D"font-s=
ize:12.8000001907349px">ymd,=C2=A0</span><span style=3D"font-size:12.800000=
1907349px">user_id)</span></b><span style=3D"font-size:12.8000001907349px">=
=C2=A0would be on one physical node?</span></div><div style=3D"font-size:12=
.8000001907349px"><span style=3D"font-size:12.8000001907349px"><br></span><=
/div></div><div><div><div class=3D"gmail_extra"><br><div class=3D"gmail_quo=
te">2015-04-04 16:38 GMT+02:00 Jack Krupansky <span dir=3D"ltr">&lt;<a href=
=3D"mailto:jack.krupansky@gmail.com" target=3D"_blank">jack.krupansky@gmail=
.com</a>&gt;</span>:<br><blockquote class=3D"gmail_quote" style=3D"margin:0=
 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Unf=
ortunately, a non-equal relation on a partition key is not supported. You w=
ould need to bucket by some larger unit, like a month, and then use the dat=
e/time as a clustering column for the row key. Then you could query within =
the partition. The partition key determines which node can satisfy the quer=
y. Designing your partition key judiciously is the key (haha!) to performan=
t Cassandra applications.</div><div class=3D"gmail_extra"><span><font color=
=3D"#888888"><br clear=3D"all"><div><div><div dir=3D"ltr">-- Jack Krupansky=
</div></div></div></font></span><div><div>
<br><div class=3D"gmail_quote">On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheyp=
ak <span dir=3D"ltr">&lt;<a href=3D"mailto:serega.sheypak@gmail.com" target=
=3D"_blank">serega.sheypak@gmail.com</a>&gt;</span> wrote:<br><blockquote c=
lass=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;=
padding-left:1ex"><div dir=3D"ltr">Hi, we plan to have 10^8 users and each =
user could generate 10 events per day.<div><div>So we have:=C2=A0</div><div=
>10^8 records per day</div><div>10^8*30 records per month.=C2=A0</div><div>=
Our timewindow analysis could be from 1 to 6 months.</div><div><br></div><d=
iv>Right now PK is=C2=A0<span style=3D"font-size:12.8000001907349px">PRIMAR=
Y KEY (user_id, ends) where endts is exact ts of event.</span></div><div><s=
pan style=3D"font-size:12.8000001907349px"><br></span></div><div><span styl=
e=3D"font-size:12.8000001907349px">So you suggest this approach:</span></di=
v><div><b><span style=3D"font-size:12.8000001907349px">PRIMARY KEY ((</span=
><span style=3D"font-size:12.8000001907349px">ymd,=C2=A0</span><span style=
=3D"font-size:12.8000001907349px">user_id), event_ts )=C2=A0</span></b></di=
v><div><b><span style=3D"font-size:12.8000001907349px">WITH CLUSTERING ORDE=
R BY (</span></b><b><span style=3D"font-size:12.8000001907349px">event_ts</=
span></b><b><span style=3D"font-size:12.8000001907349px">=C2=A0DESC);</span=
><br></b></div><div><b><span style=3D"font-size:12.8000001907349px"><br></s=
pan></b></div><div><span style=3D"font-size:12.8000001907349px">where ymd=
=3D20150102 (the Second of January)?</span><span style=3D"font-size:12.8000=
001907349px"><br></span></div><div><span style=3D"font-size:12.800000190734=
9px"><br></span></div><div><span style=3D"font-size:12.8000001907349px"><b>=
What happens to writes:</b></span></div><div><span style=3D"font-size:12.80=
00001907349px">SSTable with past days (ymd &lt; current_day) stay untouched=
 and don&#39;t take part in Compaction process since there are o changes to=
 them?</span></div><div><span style=3D"font-size:12.8000001907349px"><br></=
span></div><div><span style=3D"font-size:12.8000001907349px">What happens t=
o read:</span></div><div><span style=3D"font-size:12.8000001907349px">I iss=
ue query:=C2=A0</span></div><div><span style=3D"font-size:12.8000001907349p=
x">select some_attributes=C2=A0</span></div><div><span style=3D"font-size:1=
2.8000001907349px">from events where ymd &gt;=3D 20150101 and ymd &lt; 2015=
0301</span></div><div><span style=3D"font-size:12.8000001907349px">Does Cas=
sandra skip SSTables which don&#39;t have ymd in specified range and give m=
e a kind of partition elimination, like in traditional DBs?</span></div><di=
v><br></div></div></div><div><div><div class=3D"gmail_extra"><br><div class=
=3D"gmail_quote">2015-04-04 14:41 GMT+02:00 Jack Krupansky <span dir=3D"ltr=
">&lt;<a href=3D"mailto:jack.krupansky@gmail.com" target=3D"_blank">jack.kr=
upansky@gmail.com</a>&gt;</span>:<br><blockquote class=3D"gmail_quote" styl=
e=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div di=
r=3D"ltr">It depends on the actual number of events per user, but simply bu=
cketing the partition key can give you the same effect - clustering rows by=
 time range. A composite partition key could be comprised of the user name =
and the date.<div><br></div><div>It also depends on the data rate - is it m=
any events per day or just a few events per week, or over what time period.=
 You need to be careful - you don&#39;t want your Cassandra partitions to b=
e too big (millions of rows) or too small (just a few or even one row per p=
artition.)</div></div><div class=3D"gmail_extra"><span><font color=3D"#8888=
88"><br clear=3D"all"><div><div><div dir=3D"ltr">-- Jack Krupansky</div></d=
iv></div></font></span><div><div>
<br><div class=3D"gmail_quote">On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheyp=
ak <span dir=3D"ltr">&lt;<a href=3D"mailto:serega.sheypak@gmail.com" target=
=3D"_blank">serega.sheypak@gmail.com</a>&gt;</span> wrote:<br><blockquote c=
lass=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;=
padding-left:1ex"><div dir=3D"ltr">Hi, I switched from HBase to Cassandra a=
nd try to find problem solution for timeseries analysis on top Cassandra.<d=
iv>I have a entity named &quot;Event&quot;.</div><div>&quot;Event&quot; has=
 attributes:</div><div>user_id - a guy who triggered event</div><div>event_=
ts - when even happened</div><div>event_type - type of event</div><div>some=
_other_attr - some other attrs we don&#39;t care about right now.</div><div=
><br></div><div>The DDL for entity event looks this way:</div><div><br></di=
v><div><div>CREATE TABLE user_plans (</div><div>=C2=A0</div><div>=C2=A0 id =
timeuuid,</div><div>=C2=A0 user_id timeuuid,</div><div>=C2=A0 event_ts time=
stamp,</div><div>=C2=A0 event_type int,</div><div>=C2=A0 some_other_attr te=
xt<br></div><div>=C2=A0=C2=A0</div><div>PRIMARY KEY (user_id, ends)</div><d=
iv>);</div></div><div><br></div><div>Table is &quot;infinite&quot;, It woul=
d grow continuously during application lifetime.</div><div>I want to ask qu=
estion:</div><div>Cassandra, give me all event where event_ts &gt;=3D xxx a=
nd=C2=A0event_ts=C2=A0&lt;=3Dyyy.</div><div><br></div><div>Right now it wou=
ld lead to full table scan.</div><div><br></div><div>There is a trick in HB=
ase. HBase has table abstraction and HBase has Column Family abstraction.=
=C2=A0</div><div>Column family should be declared in advance.=C2=A0</div><d=
iv>Column family - physically is a pack of HFiles (&quot;SSTables in C*&quo=
t;).</div><div>So I can easily add partitioning for my HBase table:</div><d=
iv>alter table hbase_events add column familiy &#39;2015_01&#39;=C2=A0</div=
><div>and store all 2015 January data to Column familiy named &#39;2015_01&=
#39;.</div><div><br></div><div>When I want to get January data, I would dir=
ectly access column family named &#39;2015_01&#39; and I won&#39;t massage =
all data in table, just this piece.</div><div><br></div><div>What is approa=
ch in C* in this case?</div><div>I have an idea create several tables: even=
t_2015_01, event_2015_02, e.t.c. but it looks rather ugly from my current u=
nderstanding how it works.</div><div><br></div><div><br></div></div>
</blockquote></div><br></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
<div class=3D"gmail_signature"><div dir=3D"ltr"><div><div><p style=3D"margi=
n-top:0px;margin-right:0px;margin-bottom:12pt;margin-left:0px"></p><div>Fou=
nder/CEO=C2=A0<a href=3D"http://Spinn3r.com" target=3D"_blank">Spinn3r.com<=
/a><br></div><div>Location:=C2=A0<b>San Francisco, CA</b><br></div><div><fo=
nt color=3D"#2c2c2c" face=3D"Helvetica, Arial, sans-serif"><span style=3D"l=
ine-height:19px">blog:<b>=C2=A0</b></span></font><a href=3D"http://burtonat=
or.wordpress.com" target=3D"_blank">http://burtonator.wordpress.com</a></di=
v><div>=E2=80=A6 or check out my <a href=3D"https://plus.google.com/1027182=
74791889610666/posts" target=3D"_blank">Google+ profile</a></div><div><a hr=
ef=3D"http://spinn3r.com" target=3D"_blank"><img src=3D"http://spinn3r.com/=
images/spinn3r.jpg"></a></div><p></p></div></div></div></div>
</div>

--e89a8f50316c75c4120512fcaba6--