Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2AE8517C73 for ; Sun, 5 Apr 2015 16:30:20 +0000 (UTC) Received: (qmail 89800 invoked by uid 500); 5 Apr 2015 16:30:16 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 89757 invoked by uid 500); 5 Apr 2015 16:30:16 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 89747 invoked by uid 99); 5 Apr 2015 16:30:16 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 05 Apr 2015 16:30:16 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of burtonator2011@gmail.com designates 209.85.214.173 as permitted sender) Received: from [209.85.214.173] (HELO mail-ob0-f173.google.com) (209.85.214.173) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 05 Apr 2015 16:30:11 +0000 Received: by obbfy7 with SMTP id fy7so15069922obb.2 for ; Sun, 05 Apr 2015 09:28:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:content-type; bh=1tbfZe8aAINPwT1VBqIXoDjjabbIuTU8mEkftbvXEhU=; b=zhR/BB9LBGN5qqzapqjVKUns1fsP2Z8aGv5CKXxzSbeRrA83ayUl77bSU85wkhTaQU Pppd0MeZSwgj5jJ/jTAm/rJX9zna+QNbB3qThyPfGTzxC4Z127s56ZwMqXUuV7h6Aq35 KuFm9wS7n9KJt748/+KeKhWigNG2Uv/KzIVFRCWaVSUV7fX6KP60HYNmpYmhAktKZEo4 aQ9z5pkh7iynu/jAhUE33BofquWzvfa1ECabnqPMTrvwMvbbk4c3OUtfiWfzFI09ZGfS yspHmGGcyxgraBGl7yHqjqvPBKrAIlb3FPC4eI9Zo83pWrQqJ+DC6+5rKK2IhQO3nGp6 VRdw== X-Received: by 10.182.20.195 with SMTP id p3mr13499813obe.1.1428251301231; Sun, 05 Apr 2015 09:28:21 -0700 (PDT) MIME-Version: 1.0 Sender: burtonator2011@gmail.com Received: by 10.183.6.227 with HTTP; Sun, 5 Apr 2015 09:28:01 -0700 (PDT) In-Reply-To: References: From: Kevin Burton Date: Sun, 5 Apr 2015 09:28:01 -0700 X-Google-Sender-Auth: aL7d1ZEa73_Pp_5j6BYZspw3wZY Message-ID: Subject: Re: Timeseries analysis using Cassandra and partition by date period To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=e89a8f50316c75c4120512fcaba6 X-Virus-Checked: Checked by ClamAV on apache.org --e89a8f50316c75c4120512fcaba6 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable > Hi, I switched from HBase to Cassandra and try to find problem solution for timeseries analysis on top Cassandra. Depending on what you=E2=80=99re looking for, you might want to check out K= airosDB. 0.95 beta2 just shipped yesterday as well so you have good timing. https://github.com/kairosdb/kairosdb On Sat, Apr 4, 2015 at 11:29 AM, Serega Sheypak wrote: > Okay, so bucketing by day/week/month is a capacity planning stuff and > actual questions I want to ask. > As as a conclusion: > I have a table events > > CREATE TABLE user_plans ( > id timeuuid, > user_id timeuuid, > event_ts timestamp, > event_type int, > some_other_attr text > > PRIMARY KEY (user_id, ends) > ); > which fits tactic queries: > select smth from user_plans where user_id=3D'xxx' and end_ts > now() > > Then I create second table user_plans_daily (or weekly, monthy) > > with DDL: > CREATE TABLE user_plans_daily/weekly/monthly ( > ymd int, > user_id timeuuid, > event_ts timestamp, > event_type int, > some_other_attr text > ) > PRIMARY KEY ((ymd, user_id), event_ts ) > WITH CLUSTERING ORDER BY (event_ts DESC); > > And this table is good for answering strategic questions: > select * from > user_plans_daily/weekly/monthly > where ymd in (....) > And I should avoid long condition inside IN clause, that is why you > suggest me to create bigger bucket, correct? > > > 2015-04-04 20:00 GMT+02:00 Jack Krupansky : > >> It sounds like your time bucket should be a month, but it depends on the >> amount of data per user per day and your main query range. Within the >> partition you can then query for a range of days. >> >> Yes, all of the rows within a partition are stored on one physical node >> as well as the replica nodes. >> >> -- Jack Krupansky >> >> On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheypak >> wrote: >> >>> >non-equal relation on a partition key is not supported >>> Ok, can I generate select query: >>> select some_attributes >>> from events where ymd =3D 20150101 or ymd =3D 20150102 or 20150103 ... = or >>> 20150331 >>> >>> > The partition key determines which node can satisfy the query >>> So you mean that all rows with the same *(ymd, user_id)* would be on >>> one physical node? >>> >>> >>> 2015-04-04 16:38 GMT+02:00 Jack Krupansky : >>> >>>> Unfortunately, a non-equal relation on a partition key is not >>>> supported. You would need to bucket by some larger unit, like a month,= and >>>> then use the date/time as a clustering column for the row key. Then yo= u >>>> could query within the partition. The partition key determines which n= ode >>>> can satisfy the query. Designing your partition key judiciously is the= key >>>> (haha!) to performant Cassandra applications. >>>> >>>> -- Jack Krupansky >>>> >>>> On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak < >>>> serega.sheypak@gmail.com> wrote: >>>> >>>>> Hi, we plan to have 10^8 users and each user could generate 10 events >>>>> per day. >>>>> So we have: >>>>> 10^8 records per day >>>>> 10^8*30 records per month. >>>>> Our timewindow analysis could be from 1 to 6 months. >>>>> >>>>> Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts >>>>> of event. >>>>> >>>>> So you suggest this approach: >>>>> *PRIMARY KEY ((ymd, user_id), event_ts ) * >>>>> *WITH CLUSTERING ORDER BY (**event_ts* >>>>> * DESC);* >>>>> >>>>> where ymd=3D20150102 (the Second of January)? >>>>> >>>>> *What happens to writes:* >>>>> SSTable with past days (ymd < current_day) stay untouched and don't >>>>> take part in Compaction process since there are o changes to them? >>>>> >>>>> What happens to read: >>>>> I issue query: >>>>> select some_attributes >>>>> from events where ymd >=3D 20150101 and ymd < 20150301 >>>>> Does Cassandra skip SSTables which don't have ymd in specified range >>>>> and give me a kind of partition elimination, like in traditional DBs? >>>>> >>>>> >>>>> 2015-04-04 14:41 GMT+02:00 Jack Krupansky : >>>>> >>>>>> It depends on the actual number of events per user, but simply >>>>>> bucketing the partition key can give you the same effect - clusterin= g rows >>>>>> by time range. A composite partition key could be comprised of the u= ser >>>>>> name and the date. >>>>>> >>>>>> It also depends on the data rate - is it many events per day or just >>>>>> a few events per week, or over what time period. You need to be care= ful - >>>>>> you don't want your Cassandra partitions to be too big (millions of = rows) >>>>>> or too small (just a few or even one row per partition.) >>>>>> >>>>>> -- Jack Krupansky >>>>>> >>>>>> On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak < >>>>>> serega.sheypak@gmail.com> wrote: >>>>>> >>>>>>> Hi, I switched from HBase to Cassandra and try to find problem >>>>>>> solution for timeseries analysis on top Cassandra. >>>>>>> I have a entity named "Event". >>>>>>> "Event" has attributes: >>>>>>> user_id - a guy who triggered event >>>>>>> event_ts - when even happened >>>>>>> event_type - type of event >>>>>>> some_other_attr - some other attrs we don't care about right now. >>>>>>> >>>>>>> The DDL for entity event looks this way: >>>>>>> >>>>>>> CREATE TABLE user_plans ( >>>>>>> >>>>>>> id timeuuid, >>>>>>> user_id timeuuid, >>>>>>> event_ts timestamp, >>>>>>> event_type int, >>>>>>> some_other_attr text >>>>>>> >>>>>>> PRIMARY KEY (user_id, ends) >>>>>>> ); >>>>>>> >>>>>>> Table is "infinite", It would grow continuously during application >>>>>>> lifetime. >>>>>>> I want to ask question: >>>>>>> Cassandra, give me all event where event_ts >=3D xxx >>>>>>> and event_ts <=3Dyyy. >>>>>>> >>>>>>> Right now it would lead to full table scan. >>>>>>> >>>>>>> There is a trick in HBase. HBase has table abstraction and HBase ha= s >>>>>>> Column Family abstraction. >>>>>>> Column family should be declared in advance. >>>>>>> Column family - physically is a pack of HFiles ("SSTables in C*"). >>>>>>> So I can easily add partitioning for my HBase table: >>>>>>> alter table hbase_events add column familiy '2015_01' >>>>>>> and store all 2015 January data to Column familiy named '2015_01'. >>>>>>> >>>>>>> When I want to get January data, I would directly access column >>>>>>> family named '2015_01' and I won't massage all data in table, just = this >>>>>>> piece. >>>>>>> >>>>>>> What is approach in C* in this case? >>>>>>> I have an idea create several tables: event_2015_01, event_2015_02, >>>>>>> e.t.c. but it looks rather ugly from my current understanding how i= t works. >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > --=20 Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: http://burtonator.wordpress.com =E2=80=A6 or check out my Google+ profile --e89a8f50316c75c4120512fcaba6 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
> Hi, I switched from HB= ase to Cassandra and try to find problem solution for timeseries analysis o= n top Cassandra.

<= /div>
Depending on what you=E2=80=99re looking for, you might want to c= heck out KairosDB.

0.95 beta2 just shipped yesterd= ay as well so you have good timing.


O= n Sat, Apr 4, 2015 at 11:29 AM, Serega Sheypak <serega.sheypak@gmai= l.com> wrote:
Okay, so bucketing by day/week/month is a capacity planning stuff and a= ctual questions I want to ask.=C2=A0
As as a conclusion:
I ha= ve a table events

CREATE TABLE user_plans (
=C2=A0 id timeuuid,
=C2=A0 user_id timeuuid,
=C2=A0 event_ts timestamp,
=C2=A0 event_type int,
=C2=A0 some_other_attr text
=C2=A0=C2=A0
PRIMARY KEY (user_id, ends)
);
which fits tactic queries:=C2=A0
select smth from=C2=A0user_plans where user_id=3D'xxx' and end_ts > now()

Then I create second table= =C2=A0user_plans_daily = (or weekly, monthy)

with DDL:
CREATE TABLE user_plans_daily/weekly= /monthly (
=C2=A0 ymd int,=
=C2=A0 u= ser_id timeuuid,
=C2=A0 ev= ent_ts timestamp,
=C2=A0 e= vent_type int,
=C2=A0 some= _other_attr text
) =C2=A0
PRIMARY KEY ((ymd,=C2=A0user_id), event_ts )=C2=A0
WITH CLUSTERING ORDER BY (event_ts=C2=A0DESC);

And this table is good for answering strategic questions:=C2=A0<= /div>
select * from=C2=A0
<= div style=3D"font-size:12.8000001907349px">user_plans_daily/weekly/monthly
= where ymd in (....)
And I=C2=A0should=C2=A0avoid long condition inside IN clause, that i= s why you suggest me to create bigger bucket, correct?


2015-04-04 20:00 GMT+02:00 Jack K= rupansky <jack.krupansky@gmail.com>:
It sounds like your time bucket should b= e a month, but it depends on the amount of data per user per day and your m= ain query range. Within the partition you can then query for a range of day= s.

Yes, all of the rows within a partition are stored on= one physical node as well as the replica nodes.

-- Jack Krupansky

On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheyp= ak <serega.sheypak@gmail.com> wrote:
>non-equal relation on a partition key is not supported
Ok, can I generat= e select query:
select some_attributes=C2=A0=
from events where ymd =3D 20150101 or ymd =3D=C2= =A020150102 or=C2=A020150103 ... or=C2=A020150331
<= span>

>=C2=A0The partition key determines which nod= e can satisfy the query
So you mean that a= ll rows with the same=C2=A0(ymd,=C2=A0user_id)= =C2=A0would be on one physical node?

<= /div>

2015-04-04 16:38 GMT+02:00 Jack Krupansky <jack.krupansky@gmail= .com>:
Unf= ortunately, a non-equal relation on a partition key is not supported. You w= ould need to bucket by some larger unit, like a month, and then use the dat= e/time as a clustering column for the row key. Then you could query within = the partition. The partition key determines which node can satisfy the quer= y. Designing your partition key judiciously is the key (haha!) to performan= t Cassandra applications.

-- Jack Krupansky=

On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheyp= ak <serega.sheypak@gmail.com> wrote:
Hi, we plan to have 10^8 users and each = user could generate 10 events per day.
So we have:=C2=A0
10^8 records per day
10^8*30 records per month.=C2=A0
= Our timewindow analysis could be from 1 to 6 months.

Right now PK is=C2=A0PRIMAR= Y KEY (user_id, ends) where endts is exact ts of event.

So you suggest this approach:
PRIMARY KEY ((ymd,=C2=A0user_id), event_ts )=C2=A0
WITH CLUSTERING ORDE= R BY (event_ts=C2=A0DESC);

where ymd= =3D20150102 (the Second of January)?

= What happens to writes:
SSTable with past days (ymd < current_day) stay untouched= and don't take part in Compaction process since there are o changes to= them?

What happens t= o read:
I iss= ue query:=C2=A0
select some_attributes=C2=A0
from events where ymd >=3D 20150101 and ymd < 2015= 0301
Does Cas= sandra skip SSTables which don't have ymd in specified range and give m= e a kind of partition elimination, like in traditional DBs?


2015-04-04 14:41 GMT+02:00 Jack Krupansky <jack.kr= upansky@gmail.com>:
It depends on the actual number of events per user, but simply bu= cketing the partition key can give you the same effect - clustering rows by= time range. A composite partition key could be comprised of the user name = and the date.

It also depends on the data rate - is it m= any events per day or just a few events per week, or over what time period.= You need to be careful - you don't want your Cassandra partitions to b= e too big (millions of rows) or too small (just a few or even one row per p= artition.)

-- Jack Krupansky

On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheyp= ak <serega.sheypak@gmail.com> wrote:
Hi, I switched from HBase to Cassandra a= nd try to find problem solution for timeseries analysis on top Cassandra.I have a entity named "Event".
"Event" has= attributes:
user_id - a guy who triggered event
event_= ts - when even happened
event_type - type of event
some= _other_attr - some other attrs we don't care about right now.

The DDL for entity event looks this way:

CREATE TABLE user_plans (
=C2=A0
=C2=A0 id = timeuuid,
=C2=A0 user_id timeuuid,
=C2=A0 event_ts time= stamp,
=C2=A0 event_type int,
=C2=A0 some_other_attr te= xt
=C2=A0=C2=A0
PRIMARY KEY (user_id, ends)
);

Table is "infinite", It woul= d grow continuously during application lifetime.
I want to ask qu= estion:
Cassandra, give me all event where event_ts >=3D xxx a= nd=C2=A0event_ts=C2=A0<=3Dyyy.

Right now it wou= ld lead to full table scan.

There is a trick in HB= ase. HBase has table abstraction and HBase has Column Family abstraction.= =C2=A0
Column family should be declared in advance.=C2=A0
Column family - physically is a pack of HFiles ("SSTables in C*&quo= t;).
So I can easily add partitioning for my HBase table:
alter table hbase_events add column familiy '2015_01'=C2=A0
and store all 2015 January data to Column familiy named '2015_01&= #39;.

When I want to get January data, I would dir= ectly access column family named '2015_01' and I won't massage = all data in table, just this piece.

What is approa= ch in C* in this case?
I have an idea create several tables: even= t_2015_01, event_2015_02, e.t.c. but it looks rather ugly from my current u= nderstanding how it works.











--
= --e89a8f50316c75c4120512fcaba6--