Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5FC191015D for ; Tue, 7 May 2013 20:03:09 +0000 (UTC) Received: (qmail 93035 invoked by uid 500); 7 May 2013 20:03:06 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 92971 invoked by uid 500); 7 May 2013 20:03:06 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 92962 invoked by uid 99); 7 May 2013 20:03:06 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 May 2013 20:03:06 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,UNPARSEABLE_RELAY X-Spam-Check-By: apache.org Received-SPF: error (nike.apache.org: local policy) Received: from [216.82.254.108] (HELO mail1.bemta7.messagelabs.com) (216.82.254.108) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 May 2013 20:02:58 +0000 Received: from [216.82.255.3:33142] by server-12.bemta-7.messagelabs.com id 0C/F4-15214-9CD59815; Tue, 07 May 2013 20:02:17 +0000 X-Env-Sender: kwright@nanigans.com X-Msg-Ref: server-14.tower-209.messagelabs.com!1367956934!8447212!3 X-Originating-IP: [216.166.12.180] X-StarScan-Received: X-StarScan-Version: 6.8.6.1; banners=-,-,- X-VirusChecked: Checked Received: (qmail 8856 invoked from network); 7 May 2013 20:02:15 -0000 Received: from out001.collaborationhost.net (HELO out001.collaborationhost.net) (216.166.12.180) by server-14.tower-209.messagelabs.com with RC4-SHA encrypted SMTP; 7 May 2013 20:02:15 -0000 Received: from AUSP01VMBX28.collaborationhost.net ([192.168.20.73]) by AUSP01MHUB05.collaborationhost.net ([10.2.8.172]) with mapi; Tue, 7 May 2013 15:02:08 -0500 From: Keith Wright To: "user@cassandra.apache.org" Date: Tue, 7 May 2013 15:02:06 -0500 Subject: CQL3 Data Model Question Thread-Topic: CQL3 Data Model Question Thread-Index: Ac5LXcEvEJqOhlfDSKWNqf93Tx3gmw== Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: user-agent: Microsoft-MacOutlook/14.2.3.120616 acceptlanguage: en-US Content-Type: multipart/alternative; boundary="_000_CDAED5FED74Dkwrightnaniganscom_" MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org --_000_CDAED5FED74Dkwrightnaniganscom_ Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable Hi all, I was hoping you could provide some assistance with a data modeling que= stion (my apologies if a similar question has already been posed). I have = time based data that I need to store on a per customer (aka app id ) basis = so that I can easily return it in sorted order by event time. The data in = question is being written at high volume (~50K / sec) and I am concerned ab= out the cardinality of using either app id or event time as the row key as = either will likely result in hot spots. Here are is the table definition I= am considering: create table organic_events ( event_id UUID, app_id INT, event_time TIMESTAMP, user_id INT, =85. PRIMARY KEY (app_id, event_time, event_id) ) WITH CLUSTERING ORDER BY (app_id asc,event_time desc); So that I can be able to query as follows which will naturally sort the res= ults by time descending: select * from organic_events where app_id =3D 1234 and event_time <=3D '201= 2-01-01' and event_time > '2012-01-01'; Anyone have an idea of the best way to accomplish this? I was considering = the following: * Making the row key a concatenation of app id and 0-100 using a mod on = event id to get the value. When getting data I would just fetch all keys g= iven the mods (app_id in (1234_0,1234_1,1234_2, etc). This would alleviate= the "hot" key issue but still seems expensive and a little hacky * I tried removing app_id from the primary key all together (using prima= ry key of user_id, event_time, event_id) and making app_id a secondary inde= x. I would need to sort by time on the client. The above query is valid h= owever running a query is VERY slow as I believe it needs to fetch every ro= w key that matches the index which is quite expensive (I get a timeout in c= qlsh). * Create a different column family for each app id (I.e. 1234_organic_ev= ents). Note that we could easily have 1000s of application ids. Thanks! --_000_CDAED5FED74Dkwrightnaniganscom_ Content-Type: text/html; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable
Hi all,

    I was hoping you could provide some assistance with a data = modeling question (my apologies if a similar question has already been pose= d).  I have time based data that I need to store on a per customer (ak= a app id ) basis so that I can easily return it in sorted order by event ti= me.  The data in question is being written at high volume (~50K / sec)= and I am concerned about the cardinality of using either app id or event t= ime as the row key as either will likely result in hot spots.  Here ar= e is the table definition I am considering:

create table orga= nic_events (
event_id UUID,
app_id INT,
event_time TIMESTAMP,
user_id INT,
=85.
PRIMARY KEY (app_id, event_time= , event_id)
)  WITH CLUSTERING ORDER BY (app_id asc,event_time des= c);

So that I can be able to query as follow= s which will naturally sort the results by time descending:  

select * from organic_events where app_id =3D 1234 and eve= nt_time <=3D '2012-01-01' and event_time > '2012-01-01';
Anyone have an idea of the best way to accomplish this?  = I was considering the following:
  • Making the row key a concaten= ation of app id and 0-100 using a mod on event id to get the value.  W= hen getting data I would just fetch all keys given the mods (app_id in (123= 4_0,1234_1,1234_2, etc).  This would alleviate the "hot" key= issue but still seems expensive and a little hacky
  • I tried removin= g app_id from the primary key all together (using primary key of user_id, e= vent_time, event_id) and making app_id a secondary index.  I would nee= d to sort by time on the client.  The above query is valid however run= ning a query is VERY slow as I believe it needs to fetch every row key that= matches the index which is quite expensive (I get a timeout in cqlsh).
  • Create a different column family for each app id (I.e. 1234_organic_ev= ents).  Note that we could easily have 1000s of application ids.
  • <= /ul>
    Thanks!
    --_000_CDAED5FED74Dkwrightnaniganscom_--