incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Evan Weaver <ewea...@gmail.com>
Subject Re: schema example
Date Sat, 04 Jul 2009 01:06:23 GMT
That requires you to know the timestamp, so you can't just ask for the
most recent one.

Evan

On Fri, Jul 3, 2009 at 6:02 PM, Jonathan Ellis<jbellis@gmail.com> wrote:
> get_columns_since
>
> On Fri, Jul 3, 2009 at 7:21 PM, Evan Weaver<eweaver@gmail.com> wrote:
>> This helps a lot.
>>
>> However, I can't find any API method that actually lets me do a
>> slice query on a time-sorted column, as necessary for the second blog
>> example. I get the following error on r789419:
>>
>> InvalidRequestException: get_slice_from requires CF indexed by name
>>
>> Evan
>>
>> On Tue, May 19, 2009 at 8:00 PM, Jonathan Ellis<jbellis@gmail.com> wrote:
>>> Mail storage, man, I think pretty much anything I could come up with
>>> would look pretty simplistic compared to what "real" systems do in
>>> that domain. :)
>>>
>>> But blogs, I think I can handle those.  Let's make it ours multiuser
>>> or there isn't enough scale to make it interesting. :)
>>>
>>> The interesting thing here is we want to be able to query two things
>>> efficiently:
>>>  - the most recent posts belonging to a given blog, in reverse
>>> chronological order
>>>  - a single post and its comments, in chronological order
>>>
>>> At first glance you might think we can again reasonably do this with a
>>> single CF, this time a super CF:
>>>
>>> <ColumnFamily ColumnType="Super" ColumnSort="Time" Name="Post"/>
>>>
>>> The key is the blog name, the supercolumns are posts and the
>>> subcolumns are comments.  This would be reasonable BUT supercolumns
>>> are just containers, they have no data or timestamp associated with
>>> them directly (only through their subcolumns).  So you cannot sort a
>>> super CF by time.
>>>
>>> So instead what I would do would be to use two CFs:
>>>
>>> <ColumnFamily ColumnSort="Time" Name="Post"/>
>>> <ColumnFamily ColumnSort="Time" Name="Comment"/>
>>>
>>> For the first, the keys used would be blog names, and the columns
>>> would be the post titles and body.  So to get a list of most recent
>>> posts you just do a slice query.  Even though Cassandra currently
>>> handles large groups of columns sub-optimally, even with a blog
>>> updated several times a day you'd be safe taking this approach (i.e.
>>> we'll have that problem fixed before you start seeing it :).
>>>
>>> For the second, the keys are blog name<delimiter><post title>.  The
>>> columns are the comment data.  You can serialize these a number of
>>> ways; I would probably use title as the column name and have the value
>>> be the author + body (e.g. as a json dict).  Again we use the slice
>>> call to get the comments in order.  (We will have to manually reverse
>>> what slice gives us since time sort is always reverse chronological
>>> atm, but the overhead of doing this in memory will be negligible.)
>>>
>>> Does this help?
>>>
>>> -Jonathan
>>>
>>> On Tue, May 19, 2009 at 11:49 AM, Evan Weaver <evan@cloudbur.st> wrote:
>>>> Even if it's not actually in real-life use, some examples for common
>>>> domains would really help clarify things.
>>>>
>>>>  * blog
>>>>  * email storage
>>>>  * search index
>>>>
>>>> etc.
>>>>
>>>> Evan
>>>>
>>>> On Mon, May 18, 2009 at 8:19 PM, Jonathan Ellis <jbellis@gmail.com>
wrote:
>>>>> Does anyone have a simple app schema they can share?
>>>>>
>>>>> I can't share the one for our main app.  But we do need an example
>>>>> here.  A real one would be nice if we can find one.
>>>>>
>>>>> I checked App Engine.  They don't have a whole lot of examples either.
>>>>>  They do have a really simple one:
>>>>> http://code.google.com/appengine/docs/python/gettingstarted/usingdatastore.html
>>>>>
>>>>> The most important thing in Cassandra modeling is choosing a good key,
>>>>> since that is what most of your lookups will be by.  Keys are also how
>>>>> Cassandra scales -- Cassandra can handle effectively infinite keys
>>>>> (given enough nodes obviously) but only thousands to millions of
>>>>> columns per key/CF (depending on what API calls you use -- Jun is
>>>>> adding one now that does not deseriailze everything in the whole CF
>>>>> into memory.  The rest will need to follow this model eventually too).
>>>>>
>>>>> For this guestbook I think the choice is obvious: use the name as the
>>>>> key, and have a single simple CF for the messages.  Each column will
>>>>> be a message (you can even use the mandatory timestamp field as part
>>>>> of your user-visible data.  win!).  You get the list (or page) of
>>>>> users with get_key_range and then their messages with get_slice.
>>>>>
>>>>> <ColumnFamily ColumnSort="Name" Name="Message"/>
>>>>>
>>>>> Anyone got another one for pedagogical purposes?
>>>>>
>>>>> -Jonathan
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Evan Weaver
>>>>
>>>
>>
>>
>>
>> --
>> Evan Weaver
>>
>



-- 
Evan Weaver

Mime
View raw message