cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Serega Sheypak <serega.shey...@gmail.com>
Subject Re: Timeseries analysis using Cassandra and partition by date period
Date Sat, 04 Apr 2015 17:38:27 GMT
>non-equal relation on a partition key is not supported
Ok, can I generate select query:
select some_attributes
from events where ymd = 20150101 or ymd = 20150102 or 20150103 ... or
20150331

> The partition key determines which node can satisfy the query
So you mean that all rows with the same *(ymd, user_id)* would be on one
physical node?


2015-04-04 16:38 GMT+02:00 Jack Krupansky <jack.krupansky@gmail.com>:

> Unfortunately, a non-equal relation on a partition key is not supported.
> You would need to bucket by some larger unit, like a month, and then use
> the date/time as a clustering column for the row key. Then you could query
> within the partition. The partition key determines which node can satisfy
> the query. Designing your partition key judiciously is the key (haha!) to
> performant Cassandra applications.
>
> -- Jack Krupansky
>
> On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak <serega.sheypak@gmail.com>
> wrote:
>
>> Hi, we plan to have 10^8 users and each user could generate 10 events per
>> day.
>> So we have:
>> 10^8 records per day
>> 10^8*30 records per month.
>> Our timewindow analysis could be from 1 to 6 months.
>>
>> Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts of
>> event.
>>
>> So you suggest this approach:
>> *PRIMARY KEY ((ymd, user_id), event_ts ) *
>> *WITH CLUSTERING ORDER BY (**event_ts*
>> * DESC);*
>>
>> where ymd=20150102 (the Second of January)?
>>
>> *What happens to writes:*
>> SSTable with past days (ymd < current_day) stay untouched and don't take
>> part in Compaction process since there are o changes to them?
>>
>> What happens to read:
>> I issue query:
>> select some_attributes
>> from events where ymd >= 20150101 and ymd < 20150301
>> Does Cassandra skip SSTables which don't have ymd in specified range and
>> give me a kind of partition elimination, like in traditional DBs?
>>
>>
>> 2015-04-04 14:41 GMT+02:00 Jack Krupansky <jack.krupansky@gmail.com>:
>>
>>> It depends on the actual number of events per user, but simply bucketing
>>> the partition key can give you the same effect - clustering rows by time
>>> range. A composite partition key could be comprised of the user name and
>>> the date.
>>>
>>> It also depends on the data rate - is it many events per day or just a
>>> few events per week, or over what time period. You need to be careful - you
>>> don't want your Cassandra partitions to be too big (millions of rows) or
>>> too small (just a few or even one row per partition.)
>>>
>>> -- Jack Krupansky
>>>
>>> On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak <serega.sheypak@gmail.com
>>> > wrote:
>>>
>>>> Hi, I switched from HBase to Cassandra and try to find problem solution
>>>> for timeseries analysis on top Cassandra.
>>>> I have a entity named "Event".
>>>> "Event" has attributes:
>>>> user_id - a guy who triggered event
>>>> event_ts - when even happened
>>>> event_type - type of event
>>>> some_other_attr - some other attrs we don't care about right now.
>>>>
>>>> The DDL for entity event looks this way:
>>>>
>>>> CREATE TABLE user_plans (
>>>>
>>>>   id timeuuid,
>>>>   user_id timeuuid,
>>>>   event_ts timestamp,
>>>>   event_type int,
>>>>   some_other_attr text
>>>>
>>>> PRIMARY KEY (user_id, ends)
>>>> );
>>>>
>>>> Table is "infinite", It would grow continuously during application
>>>> lifetime.
>>>> I want to ask question:
>>>> Cassandra, give me all event where event_ts >= xxx and event_ts <=yyy.
>>>>
>>>> Right now it would lead to full table scan.
>>>>
>>>> There is a trick in HBase. HBase has table abstraction and HBase has
>>>> Column Family abstraction.
>>>> Column family should be declared in advance.
>>>> Column family - physically is a pack of HFiles ("SSTables in C*").
>>>> So I can easily add partitioning for my HBase table:
>>>> alter table hbase_events add column familiy '2015_01'
>>>> and store all 2015 January data to Column familiy named '2015_01'.
>>>>
>>>> When I want to get January data, I would directly access column family
>>>> named '2015_01' and I won't massage all data in table, just this piece.
>>>>
>>>> What is approach in C* in this case?
>>>> I have an idea create several tables: event_2015_01, event_2015_02,
>>>> e.t.c. but it looks rather ugly from my current understanding how it works.
>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message