cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Ellis <>
Subject Re: Greetings!
Date Tue, 28 Jul 2009 19:52:40 GMT
On Tue, Jul 28, 2009 at 4:26 AM, Colin Mollenhour<> wrote:
> I need to be able to fetch all or latest events with the following
> "queries":
> -A specific journal
> -All of a user's journals
> -A specific event type
> -A specific event type for a specific journal
> -A specific event type for all of a user's journals
> After much deliberation in trying to figure out how to do the above
> without having to loop through many many queries here is the schema I
> arrived at:
> If I am correct in my thinking, all of the above cases can be retrieved
> in one or two steps with the maximum number of queries being determined
> by the number of journals in question.

I think you have the right idea.  And thanks for taking the trouble to
draw a diagram, that was very useful. :)

One caveat is that the subcolumns of supercolumns are not indexed.
When you query those, Cassandra reads the entire Supercolumn into
memory.  So they are best suited for small bunches of attributes, not
up to 60k events.

If the event names cannot clash with user names then you might just
put all of the data / event / permissions data in the same row without
extra namespacing.  Otherwise, you will have to put each of those
types of data in a single row.  Which is better depends on your query
needs.  (My initial impression is the 2nd is a better fit for you

There's a related problem with your type index: Cassandra still
materializes entire rows in memory at compaction time (see
CASSANDRA-16).  So for now you might want to split those across rows
as $type|$journalid, in a simple columnfamily with each row only about
that one journal.  Then you can do range queries to get the journals
needed, then slice for the events as needed.

One other suggestion would be that it generally simplifies things to
use natural keys, rather than surrogate (_id keys).  And if you do use
surrogate keys, use UUIDs rather than numeric counters.

> Am I wrong to try to reduce the number of indexes and round-trips to the
> database by modeling this way?

No.  If anything, you may not be denormalizing enough.  Having CFs
like the event details off by itself when that's not directly needing
to be queried looks fishy.

> Some more general questions:
> My model assumes the use of get_slice_by_names with a potentially large
> number of keys, is that ok?

For the numbers you are talking about (< 100,000) it should be.  Just
be aware that serialization of the request won't be negligible at
those numbers.  Using get_slice with start and finish ranges will be
more efficient in that respect.

> Cassandra lacks transactions and increment methods, is there a way to
> generate unique user ids with just Cassandra as the authority that I am
> missing?

Yeah, UUIDs as above.

> Is it silly to use short column names for the sake of performance or
> storage efficiency? E.g. uid instead of user_id. I like verbose names...

IMO, that is unlikely to make the difference between a workable
solution and an unworkable one.


View raw message