I was hoping you could provide some assistance with a data modeling question (my apologies if a similar question has already been posed). I have time based data that I need to store on a per customer (aka app id ) basis so that I can easily return it in sorted order by event time. The data in question is being written at high volume (~50K / sec) and I am concerned about the cardinality of using either app id or event time as the row key as either will likely result in hot spots. Here are is the table definition I am considering:
create table organic_events (
PRIMARY KEY (app_id, event_time, event_id)
) WITH CLUSTERING ORDER BY (app_id asc,event_time desc);
So that I can be able to query as follows which will naturally sort the results by time descending:
select * from organic_events where app_id = 1234 and event_time <= '2012-01-01' and event_time > '2012-01-01';
Anyone have an idea of the best way to accomplish this? I was considering the following:
- Making the row key a concatenation of app id and 0-100 using a mod on event id to get the value. When getting data I would just fetch all keys given the mods (app_id in (1234_0,1234_1,1234_2, etc). This would alleviate the "hot" key issue but still seems expensive and a little hacky
- I tried removing app_id from the primary key all together (using primary key of user_id, event_time, event_id) and making app_id a secondary index. I would need to sort by time on the client. The above query is valid however running a query is VERY slow as I believe it needs to fetch every row key that matches the index which is quite expensive (I get a timeout in cqlsh).
- Create a different column family for each app id (I.e. 1234_organic_events). Note that we could easily have 1000s of application ids.