Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: local policy)
From: "Hiller, Dean" <Dean.Hiller@nrel.gov>
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Tue, 11 Dec 2012 16:26:13 -0700
Subject: Re: Primary/secondary index question / best practices?
Thread-Topic: Primary/secondary index question / best practices?
Thread-Index: Ac3X9uxOATYrjyFpTi6pkfOtyc8GWw==
Message-ID: <CCED0860.1954A%Dean.Hiller@nrel.gov>
In-Reply-To: 
 <333B362E7B77B344A2D0FD92840282611F7DCA67DD@MSGCMSIL1003.ent.wfb.bank.corp>
Accept-Language: en-US
Content-Language: en-US
user-agent: Microsoft-MacOutlook/14.2.5.121010
acceptlanguage: en-US
Content-Type: text/plain; charset="Windows-1252"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Is there any column that would be a good qualifer as a partition key?

Some people partition by time like every month or every day, and then you c=
an either have your own secondary indexes that you query into(high entropy =
is NOT a big deal here) or PlayOrm can do some for you or you could use CQL=
 as well.

Other partitioning schemes are to partition by client.

The goal is to have less than probably about 5 million rows in a partition =
so your wide row index is not too large.


Dean

From: "Stephen.M.Thompson@wellsfargo.com<mailto:Stephen.M.Thompson@wellsfar=
go.com>" <Stephen.M.Thompson@wellsfargo.com<mailto:Stephen.M.Thompson@wells=
fargo.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <us=
er@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Tuesday, December 11, 2012 3:45 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cas=
sandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: RE: Primary/secondary index question / best practices?


Dean, thank you for your response.  To the second half of the query, I=92m =
a little concerned about the secondary index approach since the indexes tha=
t I want to create are columns with high entropy.


For example, I would like to query by User name and IP address, values whic=
h are decidedly NOT like the pattern recommended in the Secondary Index fie=
ld.   The 8-10 columns I need to search by are all high a similar scatter r=
ate.  Since the documentation seems to suggest that this is a bad idea, wha=
t would the correct pattern look like?


In an RDBMS I would just slap an alternate key index on the table and let i=
t roll.   It seems like maybe that is not the right approach for Cassandra?


Thanks again,

Steve


-----Original Message-----
From: Hiller, Dean [mailto:Dean.Hiller@nrel.gov]
Sent: Tuesday, December 11, 2012 4:57 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Primary/secondary index question / best practices?


Hard to help out on a design without specifics but here is some advice base=
d on the limited information


Primary key : yes, must be cluster unique.  TimeUUID or UUID=85.PlayOrm has=
 very unique TimeUUID like keys as in this one 7AL2S8Y.b1 (b1 is the hostna=
me and the prefix is a "unique" timestamp but generated to a shorter string=
(ah, nice readable primary keys).


There are some patterns you can look into here that may help https://github=
.com/deanhiller/playorm/wiki/Patterns-Page


If you can partition your data virtually, it may help a lot so you can quer=
y into the partitions.


Later,

Dean


From: "Stephen.M.Thompson@wellsfargo.com<mailto:Stephen.M.Thompson@wellsfar=
go.com><mailto:Stephen.M.Thompson@wellsfargo.com%3cmailto:Stephen.M.Thompso=
n@wellsfargo.com%3e>" <Stephen.M.Thompson@wellsfargo.com<mailto:Stephen.M.T=
hompson@wellsfargo.com<mailto:Stephen.M.Thompson@wellsfargo.com%3cmailto:St=
ephen.M.Thompson@wellsfargo.com>>>

Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org><mail=
to:user@cassandra.apache.org%3cmailto:user@cassandra.apache.org%3e>" <user@=
cassandra.apache.org<mailto:user@cassandra.apache.org<mailto:user@cassandra=
.apache.org%3cmailto:user@cassandra.apache.org>>>

Date: Tuesday, December 11, 2012 2:49 PM

To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org><mailto:use=
r@cassandra.apache.org%3cmailto:user@cassandra.apache.org%3e>" <user@cassan=
dra.apache.org<mailto:user@cassandra.apache.org<mailto:user@cassandra.apach=
e.org%3cmailto:user@cassandra.apache.org>>>

Subject: Primary/secondary index question / best practices?


m my reading, it seems like I need a UUID column that will be my primary in=
dex, and then I should set up secondary indexes on the 8-10 primary search =
columns.  Am I understanding this correctly?  Any advice you can offer on t=
his would be tremendously helpful.  I=92m quite limited in how specific I c=
an be about the data, of course.