cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Burton <bur...@spinn3r.com>
Subject Re: Timeseries analysis using Cassandra and partition by date period
Date Sun, 05 Apr 2015 16:28:01 GMT
> Hi, I switched from HBase to Cassandra and try to find problem solution
for timeseries analysis on top Cassandra.

Depending on what you’re looking for, you might want to check out KairosDB.

0.95 beta2 just shipped yesterday as well so you have good timing.

https://github.com/kairosdb/kairosdb

On Sat, Apr 4, 2015 at 11:29 AM, Serega Sheypak <serega.sheypak@gmail.com>
wrote:

> Okay, so bucketing by day/week/month is a capacity planning stuff and
> actual questions I want to ask.
> As as a conclusion:
> I have a table events
>
> CREATE TABLE user_plans (
>   id timeuuid,
>   user_id timeuuid,
>   event_ts timestamp,
>   event_type int,
>   some_other_attr text
>
> PRIMARY KEY (user_id, ends)
> );
> which fits tactic queries:
> select smth from user_plans where user_id='xxx' and end_ts > now()
>
> Then I create second table user_plans_daily (or weekly, monthy)
>
> with DDL:
> CREATE TABLE user_plans_daily/weekly/monthly (
>   ymd int,
>   user_id timeuuid,
>   event_ts timestamp,
>   event_type int,
>   some_other_attr text
> )
> PRIMARY KEY ((ymd, user_id), event_ts )
> WITH CLUSTERING ORDER BY (event_ts DESC);
>
> And this table is good for answering strategic questions:
> select * from
> user_plans_daily/weekly/monthly
> where ymd in (....)
> And I should avoid long condition inside IN clause, that is why you
> suggest me to create bigger bucket, correct?
>
>
> 2015-04-04 20:00 GMT+02:00 Jack Krupansky <jack.krupansky@gmail.com>:
>
>> It sounds like your time bucket should be a month, but it depends on the
>> amount of data per user per day and your main query range. Within the
>> partition you can then query for a range of days.
>>
>> Yes, all of the rows within a partition are stored on one physical node
>> as well as the replica nodes.
>>
>> -- Jack Krupansky
>>
>> On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheypak <serega.sheypak@gmail.com>
>> wrote:
>>
>>> >non-equal relation on a partition key is not supported
>>> Ok, can I generate select query:
>>> select some_attributes
>>> from events where ymd = 20150101 or ymd = 20150102 or 20150103 ... or
>>> 20150331
>>>
>>> > The partition key determines which node can satisfy the query
>>> So you mean that all rows with the same *(ymd, user_id)* would be on
>>> one physical node?
>>>
>>>
>>> 2015-04-04 16:38 GMT+02:00 Jack Krupansky <jack.krupansky@gmail.com>:
>>>
>>>> Unfortunately, a non-equal relation on a partition key is not
>>>> supported. You would need to bucket by some larger unit, like a month, and
>>>> then use the date/time as a clustering column for the row key. Then you
>>>> could query within the partition. The partition key determines which node
>>>> can satisfy the query. Designing your partition key judiciously is the key
>>>> (haha!) to performant Cassandra applications.
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak <
>>>> serega.sheypak@gmail.com> wrote:
>>>>
>>>>> Hi, we plan to have 10^8 users and each user could generate 10 events
>>>>> per day.
>>>>> So we have:
>>>>> 10^8 records per day
>>>>> 10^8*30 records per month.
>>>>> Our timewindow analysis could be from 1 to 6 months.
>>>>>
>>>>> Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts
>>>>> of event.
>>>>>
>>>>> So you suggest this approach:
>>>>> *PRIMARY KEY ((ymd, user_id), event_ts ) *
>>>>> *WITH CLUSTERING ORDER BY (**event_ts*
>>>>> * DESC);*
>>>>>
>>>>> where ymd=20150102 (the Second of January)?
>>>>>
>>>>> *What happens to writes:*
>>>>> SSTable with past days (ymd < current_day) stay untouched and don't
>>>>> take part in Compaction process since there are o changes to them?
>>>>>
>>>>> What happens to read:
>>>>> I issue query:
>>>>> select some_attributes
>>>>> from events where ymd >= 20150101 and ymd < 20150301
>>>>> Does Cassandra skip SSTables which don't have ymd in specified range
>>>>> and give me a kind of partition elimination, like in traditional DBs?
>>>>>
>>>>>
>>>>> 2015-04-04 14:41 GMT+02:00 Jack Krupansky <jack.krupansky@gmail.com>:
>>>>>
>>>>>> It depends on the actual number of events per user, but simply
>>>>>> bucketing the partition key can give you the same effect - clustering
rows
>>>>>> by time range. A composite partition key could be comprised of the
user
>>>>>> name and the date.
>>>>>>
>>>>>> It also depends on the data rate - is it many events per day or just
>>>>>> a few events per week, or over what time period. You need to be careful
-
>>>>>> you don't want your Cassandra partitions to be too big (millions
of rows)
>>>>>> or too small (just a few or even one row per partition.)
>>>>>>
>>>>>> -- Jack Krupansky
>>>>>>
>>>>>> On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak <
>>>>>> serega.sheypak@gmail.com> wrote:
>>>>>>
>>>>>>> Hi, I switched from HBase to Cassandra and try to find problem
>>>>>>> solution for timeseries analysis on top Cassandra.
>>>>>>> I have a entity named "Event".
>>>>>>> "Event" has attributes:
>>>>>>> user_id - a guy who triggered event
>>>>>>> event_ts - when even happened
>>>>>>> event_type - type of event
>>>>>>> some_other_attr - some other attrs we don't care about right
now.
>>>>>>>
>>>>>>> The DDL for entity event looks this way:
>>>>>>>
>>>>>>> CREATE TABLE user_plans (
>>>>>>>
>>>>>>>   id timeuuid,
>>>>>>>   user_id timeuuid,
>>>>>>>   event_ts timestamp,
>>>>>>>   event_type int,
>>>>>>>   some_other_attr text
>>>>>>>
>>>>>>> PRIMARY KEY (user_id, ends)
>>>>>>> );
>>>>>>>
>>>>>>> Table is "infinite", It would grow continuously during application
>>>>>>> lifetime.
>>>>>>> I want to ask question:
>>>>>>> Cassandra, give me all event where event_ts >= xxx
>>>>>>> and event_ts <=yyy.
>>>>>>>
>>>>>>> Right now it would lead to full table scan.
>>>>>>>
>>>>>>> There is a trick in HBase. HBase has table abstraction and HBase
has
>>>>>>> Column Family abstraction.
>>>>>>> Column family should be declared in advance.
>>>>>>> Column family - physically is a pack of HFiles ("SSTables in
C*").
>>>>>>> So I can easily add partitioning for my HBase table:
>>>>>>> alter table hbase_events add column familiy '2015_01'
>>>>>>> and store all 2015 January data to Column familiy named '2015_01'.
>>>>>>>
>>>>>>> When I want to get January data, I would directly access column
>>>>>>> family named '2015_01' and I won't massage all data in table,
just this
>>>>>>> piece.
>>>>>>>
>>>>>>> What is approach in C* in this case?
>>>>>>> I have an idea create several tables: event_2015_01, event_2015_02,
>>>>>>> e.t.c. but it looks rather ugly from my current understanding
how it works.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>

Mime
View raw message