Hi folks – I’m doing an informal proof-of-concept with Cassandra and I’ve been getting some conflicting information about how my data layout should go.  Perhaps somebody could point me in the right direction.

 

I have a column family that will have billions of rows of data.  The data do not have any unique identifier intrinsically.  A given row will have, say, 50 columns, and I’ll need to be able to efficiently query on 8-10 of them.

 

I’ve been told that I should just pick the most common search item and make that my primary key, even though it will not be unique.  That seems contrary to the documentation I am seeing online. 

 

From my reading, it seems like I need a UUID column that will be my primary index, and then I should set up secondary indexes on the 8-10 primary search columns.  Am I understanding this correctly?  Any advice you can offer on this would be tremendously helpful.  I’m quite limited in how specific I can be about the data, of course.

 

Steve