incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David McNelis <dmcne...@gmail.com>
Subject Re: Date range queries
Date Wed, 19 Jun 2013 19:28:57 GMT
So, if you want to grab by the created_at and occasionally limit by
question id, that is why you'd use created_at.

The way the primary keys work is the first part of the primary key is the
Partioner key, that field is what essentially is the single cassandra row.
 The second key is the order preserving key, so you can sort by that key.
 If you have a third piece, then that is the secondary order preserving key.

The reason you'd want to do (user_id, created_at, question_id) is because
when you do a query on the keys, if you MUST use the preceding pieces of
the primary key.  So in your case, you could not do a query with just
user_id and question_id with the user-created-question key.  Alternatively
if you went with (user_id, question_id, created_at), you would not be able
to include a range of created_at unless you were also filtering on the
question_id.

Does that make sense?

As for the large rows, 10k is unlikely to cause you too many issues (unless
the answer is potentially a big blob of text).  Newer versions of cassandra
deal with a lot of things in far, far, superior ways to < 1.0.

For a really good primary on keys in cql and how to potentially avoid hot
rows, a really good article to read is this one:
http://thelastpickle.com/2013/01/11/primary-keys-in-cql/  Aaron did a great
job of laying out the subtleties of primary keys in CQL.


On Wed, Jun 19, 2013 at 2:21 PM, Christopher J. Bottaro <
cjbottaro@academicworks.com> wrote:

> Interesting, thank you for the reply.
>
> Two questions though...
>
> Why should created_at come before question_id in the primary key?  In
> other words, why (user_id, created_at, question_id) instead of (user_id,
> question_id, created_at)?
>
> Given this setup, all a user's answers (all 10k) will be stored in a
> single C* (internal, not cql) row?  I thought having "fat" or "big" rows
> was bad.  I worked with Cassandra 0.6 at my previous job and given the
> nature of our work, we would sometimes generate these "fat" rows... at
> which point Cassandra would basically shit the bed.
>
> Thanks for the help.
>
>
> On Wed, Jun 19, 2013 at 12:26 PM, David McNelis <dmcnelis@gmail.com>wrote:
>
>> I think you'd just be better served with just a little different primary
>> key.
>>
>> If your primary key was (user_id, created_at)  or (user_id, created_at,
>> question_id), then you'd be able to run the above query without a problem.
>>
>> This will mean that the entire pantheon of a specific user_id will be
>> stored as a 'row' (in the old style C* vernacular), and then the
>> information would be ordered by the 2nd piece of the primary key (or 2nd,
>> then 3rd if you included question_id).
>>
>> You would certainly want to include any field that makes a record unique
>> in the primary key.  Another thing to note is that if a field is part of
>> the primary key you can not create a secondary index on that field.  You
>> can work around that by storing the field twice, but you might want to
>> rethink your structure if you find yourself doing that often.
>>
>>
>> On Wed, Jun 19, 2013 at 12:05 PM, Christopher J. Bottaro <
>> cjbottaro@academicworks.com> wrote:
>>
>>> Hello,
>>>
>>> We are considering using Cassandra and I want to make sure our use case
>>> fits Cassandra's strengths.  We have the table like:
>>>
>>> answers
>>> -------
>>> user_id | question_id | result | created_at
>>>
>>> Where our most common query will be something like:
>>>
>>> SELECT * FROM answers WHERE user_id = 123 AND created_at > '01/01/2012'
>>> AND created_at < '01/01/2013'
>>>
>>> Sometimes we will also limit by a question_id or a list of question_ids.
>>>
>>> Secondary indexes will be created on user_id and question_id.  We expect
>>> the upper bound of number of answers for a given user to be around 10,000.
>>>
>>> Now my understanding of how Cassandra will run the aforementioned query
>>> is that it will load all the answers for a given user into memory using the
>>> secondary index, then scan over that set filtering based on the dates.
>>>
>>> Considering that that will be our most used query and it will happen
>>> very often, is this a bad use case for Cassandra?
>>>
>>> Thanks for the help.
>>>
>>
>>
>

Mime
View raw message