incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Ellis <jbel...@gmail.com>
Subject Re: schema example
Date Wed, 20 May 2009 03:00:28 GMT
Mail storage, man, I think pretty much anything I could come up with
would look pretty simplistic compared to what "real" systems do in
that domain. :)

But blogs, I think I can handle those.  Let's make it ours multiuser
or there isn't enough scale to make it interesting. :)

The interesting thing here is we want to be able to query two things
efficiently:
 - the most recent posts belonging to a given blog, in reverse
chronological order
 - a single post and its comments, in chronological order

At first glance you might think we can again reasonably do this with a
single CF, this time a super CF:

<ColumnFamily ColumnType="Super" ColumnSort="Time" Name="Post"/>

The key is the blog name, the supercolumns are posts and the
subcolumns are comments.  This would be reasonable BUT supercolumns
are just containers, they have no data or timestamp associated with
them directly (only through their subcolumns).  So you cannot sort a
super CF by time.

So instead what I would do would be to use two CFs:

<ColumnFamily ColumnSort="Time" Name="Post"/>
<ColumnFamily ColumnSort="Time" Name="Comment"/>

For the first, the keys used would be blog names, and the columns
would be the post titles and body.  So to get a list of most recent
posts you just do a slice query.  Even though Cassandra currently
handles large groups of columns sub-optimally, even with a blog
updated several times a day you'd be safe taking this approach (i.e.
we'll have that problem fixed before you start seeing it :).

For the second, the keys are blog name<delimiter><post title>.  The
columns are the comment data.  You can serialize these a number of
ways; I would probably use title as the column name and have the value
be the author + body (e.g. as a json dict).  Again we use the slice
call to get the comments in order.  (We will have to manually reverse
what slice gives us since time sort is always reverse chronological
atm, but the overhead of doing this in memory will be negligible.)

Does this help?

-Jonathan

On Tue, May 19, 2009 at 11:49 AM, Evan Weaver <evan@cloudbur.st> wrote:
> Even if it's not actually in real-life use, some examples for common
> domains would really help clarify things.
>
>  * blog
>  * email storage
>  * search index
>
> etc.
>
> Evan
>
> On Mon, May 18, 2009 at 8:19 PM, Jonathan Ellis <jbellis@gmail.com> wrote:
>> Does anyone have a simple app schema they can share?
>>
>> I can't share the one for our main app.  But we do need an example
>> here.  A real one would be nice if we can find one.
>>
>> I checked App Engine.  They don't have a whole lot of examples either.
>>  They do have a really simple one:
>> http://code.google.com/appengine/docs/python/gettingstarted/usingdatastore.html
>>
>> The most important thing in Cassandra modeling is choosing a good key,
>> since that is what most of your lookups will be by.  Keys are also how
>> Cassandra scales -- Cassandra can handle effectively infinite keys
>> (given enough nodes obviously) but only thousands to millions of
>> columns per key/CF (depending on what API calls you use -- Jun is
>> adding one now that does not deseriailze everything in the whole CF
>> into memory.  The rest will need to follow this model eventually too).
>>
>> For this guestbook I think the choice is obvious: use the name as the
>> key, and have a single simple CF for the messages.  Each column will
>> be a message (you can even use the mandatory timestamp field as part
>> of your user-visible data.  win!).  You get the list (or page) of
>> users with get_key_range and then their messages with get_slice.
>>
>> <ColumnFamily ColumnSort="Name" Name="Message"/>
>>
>> Anyone got another one for pedagogical purposes?
>>
>> -Jonathan
>>
>
>
>
> --
> Evan Weaver
>

Mime
View raw message