incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adriano Paggi <adri...@pert.com.ar>
Subject RE: CQL3 Data Model Question
Date Wed, 08 May 2013 12:19:58 GMT
At those rates, to avoid hot spots, and maintain rows in a manageable size (~10MB or so), maybe
you should partition every 5 or 10 minutes per app_id, so your partition key looks like <app_id
+ YYYYMMDDHHmm > i.e 1234_201305081300, 1234_201305081305, 1234_201305081310, 1234_201305081315,
etc.

Keep in mind that, when you say

>>select * from organic_events where app_id = 1234 and event_time <=
>>'2012-01-01' and event_time > '2012-01-01';

That's, 50.000 * 60 * 60 * 24 * <size of your record> => a lot of bytes!!
You need to use LIMIT on that.

Regards,
Adriano.

-----Original Message-----
From: Hiller, Dean [mailto:Dean.Hiller@nrel.gov] 
Sent: Martes, 07 de Mayo de 2013 05:52 p.m.
To: user@cassandra.apache.org
Subject: Re: CQL3 Data Model Question

Playorm is not yet on CQL3 and cassandra doesn't work well with +10,000
CF's as we went down that path and cassandra can't cope, so we have one
cassandra CF with 60,000 virtual CF's thanks to PlayOrm and a few other
CF's.

But yes, we bucket into hour or month or whatever depending on your rates
and have an exact timestamp as well.  That is one option.  You can
virtualize without playorm by just prefixing the rowkey with the device id
each time and reversing that on reads of course.  I am not sure if you
need to partition by time or not after that as that dependson number of
rows per device.

Later,
Dean

On 5/7/13 2:42 PM, "Keith Wright" <kwright@nanigans.com> wrote:

>So in that case I would create a different column family for each app id
>and then a "time bucket" key as the row key with perhaps an hour
>resolution?  Something like this:
>
>create 123_table organic_events (
>   hour timestamp,
>   event_id UUID,
>   app_id INT,
>   event_time TIMESTAMP,
>   user_id INT,
>   Š.
>   PRIMARY KEY (hour, event_time, event_id)
>)  WITH CLUSTERING ORDER BY (event_time desc);
>
>
>
>Is this what others are doing?
>
>
>On 5/7/13 4:18 PM, "Hiller, Dean" <Dean.Hiller@nrel.gov> wrote:
>
>>We use PlayOrm to do 60,000 different streams which are all time series
>>and use the virtual column families of PlayOrm so they are all in one
>>column family.  We then partition by time as well.  I don't believe that
>>we really have any hotspots from what I can tell.
>>
>>Dean
>>
>>From: Keith Wright <kwright@nanigans.com<mailto:kwright@nanigans.com>>
>>Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>"
>><user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
>>Date: Tuesday, May 7, 2013 2:02 PM
>>To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>"
>><user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
>>Subject: CQL3 Data Model Question
>>
>>Hi all,
>>
>>    I was hoping you could provide some assistance with a data modeling
>>question (my apologies if a similar question has already been posed).  I
>>have time based data that I need to store on a per customer (aka app id )
>>basis so that I can easily return it in sorted order by event time.  The
>>data in question is being written at high volume (~50K / sec) and I am
>>concerned about the cardinality of using either app id or event time as
>>the row key as either will likely result in hot spots.  Here are is the
>>table definition I am considering:
>>
>>create table organic_events (
>>event_id UUID,
>>app_id INT,
>>event_time TIMESTAMP,
>>user_id INT,
>>Š.
>>PRIMARY KEY (app_id, event_time, event_id)
>>)  WITH CLUSTERING ORDER BY (app_id asc,event_time desc);
>>
>>So that I can be able to query as follows which will naturally sort the
>>results by time descending:
>>
>>select * from organic_events where app_id = 1234 and event_time <=
>>'2012-01-01' and event_time > '2012-01-01';
>>
>>Anyone have an idea of the best way to accomplish this?  I was
>>considering the following:
>>
>> *   Making the row key a concatenation of app id and 0-100 using a mod
>>on event id to get the value.  When getting data I would just fetch all
>>keys given the mods (app_id in (1234_0,1234_1,1234_2, etc).  This would
>>alleviate the "hot" key issue but still seems expensive and a little
>>hacky
>> *   I tried removing app_id from the primary key all together (using
>>primary key of user_id, event_time, event_id) and making app_id a
>>secondary index.  I would need to sort by time on the client.  The above
>>query is valid however running a query is VERY slow as I believe it needs
>>to fetch every row key that matches the index which is quite expensive (I
>>get a timeout in cqlsh).
>> *   Create a different column family for each app id (I.e.
>>1234_organic_events).  Note that we could easily have 1000s of
>>application ids.
>>
>>Thanks!
>


Mime
View raw message