Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: error (nike.apache.org: local policy)
From: Keith Wright <kwright@nanigans.com>
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Tue, 7 May 2013 15:02:06 -0500
Subject: CQL3 Data Model Question
Thread-Topic: CQL3 Data Model Question
Thread-Index: Ac5LXcEvEJqOhlfDSKWNqf93Tx3gmw==
Message-ID: <CDAED5FE.D74D%kwright@nanigans.com>
Accept-Language: en-US
Content-Language: en-US
user-agent: Microsoft-MacOutlook/14.2.3.120616
acceptlanguage: en-US
Content-Type: multipart/alternative;
	boundary="_000_CDAED5FED74Dkwrightnaniganscom_"
MIME-Version: 1.0

--_000_CDAED5FED74Dkwrightnaniganscom_
Content-Type: text/plain; charset="Windows-1252"
Content-Transfer-Encoding: quoted-printable

Hi all,

    I was hoping you could provide some assistance with a data modeling que=
stion (my apologies if a similar question has already been posed).  I have =
time based data that I need to store on a per customer (aka app id ) basis =
so that I can easily return it in sorted order by event time.  The data in =
question is being written at high volume (~50K / sec) and I am concerned ab=
out the cardinality of using either app id or event time as the row key as =
either will likely result in hot spots.  Here are is the table definition I=
 am considering:

create table organic_events (
event_id UUID,
app_id INT,
event_time TIMESTAMP,
user_id INT,
=85.
PRIMARY KEY (app_id, event_time, event_id)
)  WITH CLUSTERING ORDER BY (app_id asc,event_time desc);

So that I can be able to query as follows which will naturally sort the res=
ults by time descending:

select * from organic_events where app_id =3D 1234 and event_time <=3D '201=
2-01-01' and event_time > '2012-01-01';

Anyone have an idea of the best way to accomplish this?  I was considering =
the following:

 *   Making the row key a concatenation of app id and 0-100 using a mod on =
event id to get the value.  When getting data I would just fetch all keys g=
iven the mods (app_id in (1234_0,1234_1,1234_2, etc).  This would alleviate=
 the "hot" key issue but still seems expensive and a little hacky
 *   I tried removing app_id from the primary key all together (using prima=
ry key of user_id, event_time, event_id) and making app_id a secondary inde=
x.  I would need to sort by time on the client.  The above query is valid h=
owever running a query is VERY slow as I believe it needs to fetch every ro=
w key that matches the index which is quite expensive (I get a timeout in c=
qlsh).
 *   Create a different column family for each app id (I.e. 1234_organic_ev=
ents).  Note that we could easily have 1000s of application ids.

Thanks!

--_000_CDAED5FED74Dkwrightnaniganscom_
Content-Type: text/html; charset="Windows-1252"
Content-Transfer-Encoding: quoted-printable

<html><head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3DWindows-1=
252"></head><body style=3D"word-wrap: break-word; -webkit-nbsp-mode: space;=
 -webkit-line-break: after-white-space; font-size: 14px; font-family: Calib=
ri, sans-serif; color: rgb(0, 0, 0); "><div>Hi all,</div><div><br></div><di=
v>&nbsp; &nbsp; I was hoping you could provide some assistance with a data =
modeling question (my apologies if a similar question has already been pose=
d). &nbsp;I have time based data that I need to store on a per customer (ak=
a app id ) basis so that I can easily return it in sorted order by event ti=
me. &nbsp;The data in question is being written at high volume (~50K / sec)=
 and I am concerned about the cardinality of using either app id or event t=
ime as the row key as either will likely result in hot spots. &nbsp;Here ar=
e is the table definition I am considering:</div><div><br></div><div><div s=
tyle=3D"font-family: Noteworthy-Light; font-size: 15px; ">create table orga=
nic_events (</div><div style=3D"font-family: Noteworthy-Light; font-size: 1=
5px; ">event_id UUID,</div><div style=3D"font-family: Noteworthy-Light; fon=
t-size: 15px; ">app_id INT,</div><div style=3D"font-family: Noteworthy-Ligh=
t; font-size: 15px; ">event_time TIMESTAMP,</div><div style=3D"font-family:=
 Noteworthy-Light; font-size: 15px; ">user_id INT,</div><div style=3D"font-=
family: Noteworthy-Light; font-size: 15px; ">=85.</div><div style=3D"font-f=
amily: Noteworthy-Light; font-size: 15px; ">PRIMARY KEY (app_id, event_time=
,&nbsp;event_id)</div><div style=3D"font-family: Noteworthy-Light; font-siz=
e: 15px; ">) &nbsp;WITH&nbsp;CLUSTERING ORDER BY (app_id asc,event_time des=
c);</div></div><div><br></div><div>So that I can be able to query as follow=
s which will naturally sort the results by time descending: &nbsp;</div><di=
v><br></div><div>select * from organic_events where app_id =3D 1234 and eve=
nt_time &lt;=3D '2012-01-01' and event_time &gt; '2012-01-01';</div><div><b=
r></div><div>Anyone have an idea of the best way to accomplish this? &nbsp;=
I was considering the following:</div><ul><li>Making the row key a concaten=
ation of app id and 0-100 using a mod on event id to get the value. &nbsp;W=
hen getting data I would just fetch all keys given the mods (app_id in (123=
4_0,1234_1,1234_2, etc). &nbsp;This would alleviate the &quot;hot&quot; key=
 issue but still seems expensive and a little hacky</li><li>I tried removin=
g app_id from the primary key all together (using primary key of user_id, e=
vent_time, event_id) and making app_id a secondary index. &nbsp;I would nee=
d to sort by time on the client. &nbsp;The above query is valid however run=
ning a query is VERY slow as I believe it needs to fetch every row key that=
 matches the index which is quite expensive (I get a timeout in cqlsh).</li=
><li>Create a different column family for each app id (I.e. 1234_organic_ev=
ents). &nbsp;Note that we could easily have 1000s of application ids.</li><=
/ul><div>Thanks!</div></body></html>

--_000_CDAED5FED74Dkwrightnaniganscom_--