Return-Path: Delivered-To: apmail-incubator-cassandra-user-archive@minotaur.apache.org Received: (qmail 28779 invoked from network); 13 Aug 2009 17:16:37 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 13 Aug 2009 17:16:37 -0000 Received: (qmail 21087 invoked by uid 500); 13 Aug 2009 17:16:44 -0000 Delivered-To: apmail-incubator-cassandra-user-archive@incubator.apache.org Received: (qmail 21062 invoked by uid 500); 13 Aug 2009 17:16:43 -0000 Mailing-List: contact cassandra-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: cassandra-user@incubator.apache.org Delivered-To: mailing list cassandra-user@incubator.apache.org Received: (qmail 21053 invoked by uid 99); 13 Aug 2009 17:16:43 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Aug 2009 17:16:43 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of colin@mollenhour.com designates 208.106.250.144 as permitted sender) Received: from [208.106.250.144] (HELO mail.mollenhour.com) (208.106.250.144) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Aug 2009 17:16:33 +0000 DomainKey-Signature: a=rsa-sha1; c=nofws; s=all; d=mollenhour.com; q=dns; h=received:message-id:date:from:user-agent:mime-version:to:subject:references:in-reply-to:content-type:content-transfer-encoding; b=y/YmSzpowouDMv/mwQXZU4Knvsn0IbLgJ1H2wvp/bUll2/F81d1tsSrv7KqzGi7RyTNfHdAuC0LcPp1vNtgxOA==; Received: from c-68-52-14-201.hsd1.tn.comcast.net [68.52.14.201] by mail.mollenhour.com with SMTP; Thu, 13 Aug 2009 10:15:57 -0700 Message-ID: <4A844A4E.1020003@mollenhour.com> Date: Thu, 13 Aug 2009 13:15:58 -0400 From: Colin Mollenhour User-Agent: Thunderbird 2.0.0.22 (Windows/20090605) MIME-Version: 1.0 To: cassandra-user@incubator.apache.org Subject: Re: Visual representation of Cassandra data model References: <5676b0940908121657v248894eap9d112575d0596b7b@mail.gmail.com> In-Reply-To: <5676b0940908121657v248894eap9d112575d0596b7b@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org I'm really glad that you all are working on this, Cassandra's data model to me was still is a big learning curve to completely digest due to the various unknown implications (to Cassandra newbies especially) that the data model has on performance and usability. This also seems to changing somewhat with the Thrift API changes so it would be really nice to have a "designing a Cassandra schema for your application" guide. In your model I don't think it is best to have a general "map" SC with all of the relations in it since there will be unnecessary deserialization and network transfer of the map data that you won't always make use of. I think you should denormalize and use separate CFs for the various mappings. Cassandra handles lots of keys better than large SCs from what I understand. Here is my first stab at the data model you are working on: Schema Legend: (SC|CF keyed on ) : {: , ...} or : [: {: , ...}, : {...}, ...] Delicious Keyspace Schema: user (CF keyed on nick) "mccv": {name: "Mark McBride", email: "email@address.com"} bookmark (SC keyed on url with CFs for related users and related tags) "http://thesartorialist.blogspot.com": [details: {title: "The Sartorialist", other_meta_data: }, users: {"mccv": null}, tags: {"blog": null, "news": null}] (storing users here may be overkill, but it is reasonable that when retrieving a bookmark you will usually want the tags too) bookmark_tag_users (CF keyed on bookmark|tag containing list of related users) "http://thesartorialist.blogspot.com|blog": {"mvcc": null, ...} "http://thesartorialist.blogspot.com|news": {"mvcc": null, ...} user_bookmark_tags (CF keyed on user|bookmark to lookup a user's tags for a bookmark or all of a user's bookmarks and their tags (using key_range)) "mccv|http://thesartorialist.blogspot.com": {"blog": null, "news": null, ...} tag_bookmarks (CF keyed on tag name to lookup all bookmarks for a given tag) "blog": {"http://thesartorialist.blogspot.com": "The Sartorialist", ...} "news": {"http://thesartorialist.blogspot.com": "The Sartorialist", ...} user_tag_bookmarks (CF keyed on tag|user to lookup all bookmarks for a given tag and user or just a given user (using key_range)) "mccv|blog": {"http://thesartorialist.blogspot.com":"The Sartorialist", ...} "mccv|news": {"http://thesartorialist.blogspot.com":"The Sartorialist", ...} I think a good approach to designing a Cassandra schema from scratch is to make a list of the queries that you *know* you will need to be fast and then look at your model attempts and see how well it fits while trying to minimize overhead. Example: -All bookmarks for a user -All of a user's bookmarks for a tag -All bookmarks for a tag -All tags for a bookmark -etc.. I would start with a highly denormalized schema that consists of only simple CFs. My take on SCs is that if you know that every time you retrieve data from one CF for a key you will also retrieve data for another CF with the same key, then you should probably combine them in a SC, otherwise they probably need to be in a separate simple CF (due to the entire SC having to be deserialized in memory just to retrieve a slice). However it seems like you can end up with lots of special purpose CFs used as maps and I'm not sure at what point you would want to simply go with a different database system with a richer querying capability.. I don't know much about Delicious, but it seems that using natural keys is perfectly acceptable in this case. I'm sure this isn't the best schema but it is an alternative approach. I'd really love to see how the experts would model this in a production system. Thanks, Colin Mark McBride wrote: > While working on an updated data model wiki page I'm trying to put > together a graphical representation of the data model. I threw this > together based on Curt's goal of modeling delicious. The basic gist > is descriptive data for tags, users, and bookmarks goes in the > Description column family. The relationships between bookmarks, tags > and users goes in the map supercolumn. I'm not sure this is how you > would do it in production (I'm guessing at the very least you'd want > separate supercolumns for bookmarks, tags and users), but it seems to > be simple enough for a new user to digest, and covers all the bases of > the data model (aside from ordering I guess). So two questions > > 1) did I get it right (I'm new to this as well)? > 2) is this a useful representation? > > ---Mark > > > ------------------------------------------------------------------------ >