incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Evan Weaver <ewea...@gmail.com>
Subject Re: schema example
Date Sat, 04 Jul 2009 02:05:01 GMT
FYI, Yahoo does an interesting thing in this case. They usually use
token pagination, but if a page displays limit 20 records, they
actually request limit 100 behind the scenes. The extra records are
used to generate deep links. So instead of just being able to go to
the next page:

prev | cur | next

You can render:

prev | cur | next | cur + 2 | cur + 3 | cur + 4 | cur + 5

This lets you smoothly trade off navigability for performance.

Evan

On Fri, Jul 3, 2009 at 6:53 PM, Evan Weaver<eweaver@gmail.com> wrote:
> (From talking on IRC):
>
> I think this boils down to the offset/limit vs. token/limit debate.
>
> Token/limit is fine in all cases for me, but you still have to be able
> to query the head of the list (with a limit, but no token) to get
> started. Right now there is no facility for that on time-sorted column
> families:
>
>  list<column_t> get_columns_since(1:string tablename, 2:string key,
> 3:string columnParent, 4:i64 timeStamp)
>
> I don't think token ranges are supported on time columns, either.
>
> Also, to be optimally useable, you need to be able to begin a
> token-based pagination system from either the head or tail of the
> list, but that may not be possible with sstables.
>
> It may just be an oversight...the API is confusingly organized, and
> it's hard to be sure if some likely feature is there or not.
>
> Related:
>
> http://issues.apache.org/jira/browse/CASSANDRA-261
> http://issues.apache.org/jira/browse/CASSANDRA-217
> http://issues.apache.org/jira/browse/CASSANDRA-263
>
>
> Evan
>
> On Fri, Jul 3, 2009 at 6:06 PM, Evan Weaver<eweaver@gmail.com> wrote:
>> That requires you to know the timestamp, so you can't just ask for the
>> most recent one.
>>
>> Evan
>>
>> On Fri, Jul 3, 2009 at 6:02 PM, Jonathan Ellis<jbellis@gmail.com> wrote:
>>> get_columns_since
>>>
>>> On Fri, Jul 3, 2009 at 7:21 PM, Evan Weaver<eweaver@gmail.com> wrote:
>>>> This helps a lot.
>>>>
>>>> However, I can't find any API method that actually lets me do a
>>>> slice query on a time-sorted column, as necessary for the second blog
>>>> example. I get the following error on r789419:
>>>>
>>>> InvalidRequestException: get_slice_from requires CF indexed by name
>>>>
>>>> Evan
>>>>
>>>> On Tue, May 19, 2009 at 8:00 PM, Jonathan Ellis<jbellis@gmail.com>
wrote:
>>>>> Mail storage, man, I think pretty much anything I could come up with
>>>>> would look pretty simplistic compared to what "real" systems do in
>>>>> that domain. :)
>>>>>
>>>>> But blogs, I think I can handle those.  Let's make it ours multiuser
>>>>> or there isn't enough scale to make it interesting. :)
>>>>>
>>>>> The interesting thing here is we want to be able to query two things
>>>>> efficiently:
>>>>>  - the most recent posts belonging to a given blog, in reverse
>>>>> chronological order
>>>>>  - a single post and its comments, in chronological order
>>>>>
>>>>> At first glance you might think we can again reasonably do this with
a
>>>>> single CF, this time a super CF:
>>>>>
>>>>> <ColumnFamily ColumnType="Super" ColumnSort="Time" Name="Post"/>
>>>>>
>>>>> The key is the blog name, the supercolumns are posts and the
>>>>> subcolumns are comments.  This would be reasonable BUT supercolumns
>>>>> are just containers, they have no data or timestamp associated with
>>>>> them directly (only through their subcolumns).  So you cannot sort a
>>>>> super CF by time.
>>>>>
>>>>> So instead what I would do would be to use two CFs:
>>>>>
>>>>> <ColumnFamily ColumnSort="Time" Name="Post"/>
>>>>> <ColumnFamily ColumnSort="Time" Name="Comment"/>
>>>>>
>>>>> For the first, the keys used would be blog names, and the columns
>>>>> would be the post titles and body.  So to get a list of most recent
>>>>> posts you just do a slice query.  Even though Cassandra currently
>>>>> handles large groups of columns sub-optimally, even with a blog
>>>>> updated several times a day you'd be safe taking this approach (i.e.
>>>>> we'll have that problem fixed before you start seeing it :).
>>>>>
>>>>> For the second, the keys are blog name<delimiter><post title>.
 The
>>>>> columns are the comment data.  You can serialize these a number of
>>>>> ways; I would probably use title as the column name and have the value
>>>>> be the author + body (e.g. as a json dict).  Again we use the slice
>>>>> call to get the comments in order.  (We will have to manually reverse
>>>>> what slice gives us since time sort is always reverse chronological
>>>>> atm, but the overhead of doing this in memory will be negligible.)
>>>>>
>>>>> Does this help?
>>>>>
>>>>> -Jonathan
>>>>>
>>>>> On Tue, May 19, 2009 at 11:49 AM, Evan Weaver <evan@cloudbur.st>
wrote:
>>>>>> Even if it's not actually in real-life use, some examples for common
>>>>>> domains would really help clarify things.
>>>>>>
>>>>>>  * blog
>>>>>>  * email storage
>>>>>>  * search index
>>>>>>
>>>>>> etc.
>>>>>>
>>>>>> Evan
>>>>>>
>>>>>> On Mon, May 18, 2009 at 8:19 PM, Jonathan Ellis <jbellis@gmail.com>
wrote:
>>>>>>> Does anyone have a simple app schema they can share?
>>>>>>>
>>>>>>> I can't share the one for our main app.  But we do need an example
>>>>>>> here.  A real one would be nice if we can find one.
>>>>>>>
>>>>>>> I checked App Engine.  They don't have a whole lot of examples
either.
>>>>>>>  They do have a really simple one:
>>>>>>> http://code.google.com/appengine/docs/python/gettingstarted/usingdatastore.html
>>>>>>>
>>>>>>> The most important thing in Cassandra modeling is choosing a
good key,
>>>>>>> since that is what most of your lookups will be by.  Keys are
also how
>>>>>>> Cassandra scales -- Cassandra can handle effectively infinite
keys
>>>>>>> (given enough nodes obviously) but only thousands to millions
of
>>>>>>> columns per key/CF (depending on what API calls you use -- Jun
is
>>>>>>> adding one now that does not deseriailze everything in the whole
CF
>>>>>>> into memory.  The rest will need to follow this model eventually
too).
>>>>>>>
>>>>>>> For this guestbook I think the choice is obvious: use the name
as the
>>>>>>> key, and have a single simple CF for the messages.  Each column
will
>>>>>>> be a message (you can even use the mandatory timestamp field
as part
>>>>>>> of your user-visible data.  win!).  You get the list (or page)
of
>>>>>>> users with get_key_range and then their messages with get_slice.
>>>>>>>
>>>>>>> <ColumnFamily ColumnSort="Name" Name="Message"/>
>>>>>>>
>>>>>>> Anyone got another one for pedagogical purposes?
>>>>>>>
>>>>>>> -Jonathan
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Evan Weaver
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Evan Weaver
>>>>
>>>
>>
>>
>>
>> --
>> Evan Weaver
>>
>
>
>
> --
> Evan Weaver
>



-- 
Evan Weaver

Mime
View raw message