Return-Path: Delivered-To: apmail-incubator-cassandra-user-archive@minotaur.apache.org Received: (qmail 4021 invoked from network); 20 May 2009 03:00:45 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 20 May 2009 03:00:45 -0000 Received: (qmail 92975 invoked by uid 500); 20 May 2009 03:00:58 -0000 Delivered-To: apmail-incubator-cassandra-user-archive@incubator.apache.org Received: (qmail 92961 invoked by uid 500); 20 May 2009 03:00:58 -0000 Mailing-List: contact cassandra-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: cassandra-user@incubator.apache.org Delivered-To: mailing list cassandra-user@incubator.apache.org Received: (qmail 92952 invoked by uid 99); 20 May 2009 03:00:58 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 May 2009 03:00:58 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jbellis@gmail.com designates 209.85.219.172 as permitted sender) Received: from [209.85.219.172] (HELO mail-ew0-f172.google.com) (209.85.219.172) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 May 2009 03:00:48 +0000 Received: by ewy20 with SMTP id 20so237575ewy.12 for ; Tue, 19 May 2009 20:00:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=w80EdfgZNFJYOsks7y5jg8iBYJaNd4ZQWUYYvIUcmZE=; b=MRKGUwYfdqbeBqYLPAsTOGkJ8JZbp1yDK759TsAa1hj6QSWvChYPJVI+RMYS7rBycA HGQU3RWNW9FoFHKFdp3FNMRLbM7a7ra8Iss0Zkh3RX5rK3wEIv0uqLT4+QAbPlN/y9Z9 00wd77hWkjoP3YG5gIUTR+//zSPbQPbir3lvU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=R4V+ec8Htw/ZUNfMFaxhCjABAW8HjY8oM9q6PlVhUmz3ZfTSuEovQV2/YO7iwfkW0c b2o8NfIfiHnM54dWmWDfgW78k4YNS2sw8hDBxq0GbyakkzVabcFOCj/maO/Itz5T+ZzI 6JCjlOuLaZ246FxkSf3j/T2jkSu4aS5MZKKno= MIME-Version: 1.0 Received: by 10.216.1.81 with SMTP id 59mr165840wec.155.1242788428764; Tue, 19 May 2009 20:00:28 -0700 (PDT) In-Reply-To: References: Date: Tue, 19 May 2009 22:00:28 -0500 Message-ID: Subject: Re: schema example From: Jonathan Ellis To: cassandra-user@incubator.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Mail storage, man, I think pretty much anything I could come up with would look pretty simplistic compared to what "real" systems do in that domain. :) But blogs, I think I can handle those. Let's make it ours multiuser or there isn't enough scale to make it interesting. :) The interesting thing here is we want to be able to query two things efficiently: - the most recent posts belonging to a given blog, in reverse chronological order - a single post and its comments, in chronological order At first glance you might think we can again reasonably do this with a single CF, this time a super CF: The key is the blog name, the supercolumns are posts and the subcolumns are comments. This would be reasonable BUT supercolumns are just containers, they have no data or timestamp associated with them directly (only through their subcolumns). So you cannot sort a super CF by time. So instead what I would do would be to use two CFs: For the first, the keys used would be blog names, and the columns would be the post titles and body. So to get a list of most recent posts you just do a slice query. Even though Cassandra currently handles large groups of columns sub-optimally, even with a blog updated several times a day you'd be safe taking this approach (i.e. we'll have that problem fixed before you start seeing it :). For the second, the keys are blog name. The columns are the comment data. You can serialize these a number of ways; I would probably use title as the column name and have the value be the author + body (e.g. as a json dict). Again we use the slice call to get the comments in order. (We will have to manually reverse what slice gives us since time sort is always reverse chronological atm, but the overhead of doing this in memory will be negligible.) Does this help? -Jonathan On Tue, May 19, 2009 at 11:49 AM, Evan Weaver wrote: > Even if it's not actually in real-life use, some examples for common > domains would really help clarify things. > > =A0* blog > =A0* email storage > =A0* search index > > etc. > > Evan > > On Mon, May 18, 2009 at 8:19 PM, Jonathan Ellis wrote= : >> Does anyone have a simple app schema they can share? >> >> I can't share the one for our main app. =A0But we do need an example >> here. =A0A real one would be nice if we can find one. >> >> I checked App Engine. =A0They don't have a whole lot of examples either. >> =A0They do have a really simple one: >> http://code.google.com/appengine/docs/python/gettingstarted/usingdatasto= re.html >> >> The most important thing in Cassandra modeling is choosing a good key, >> since that is what most of your lookups will be by. =A0Keys are also how >> Cassandra scales -- Cassandra can handle effectively infinite keys >> (given enough nodes obviously) but only thousands to millions of >> columns per key/CF (depending on what API calls you use -- Jun is >> adding one now that does not deseriailze everything in the whole CF >> into memory. =A0The rest will need to follow this model eventually too). >> >> For this guestbook I think the choice is obvious: use the name as the >> key, and have a single simple CF for the messages. =A0Each column will >> be a message (you can even use the mandatory timestamp field as part >> of your user-visible data. =A0win!). =A0You get the list (or page) of >> users with get_key_range and then their messages with get_slice. >> >> >> >> Anyone got another one for pedagogical purposes? >> >> -Jonathan >> > > > > -- > Evan Weaver >