Return-Path: Delivered-To: apmail-incubator-cassandra-user-archive@minotaur.apache.org Received: (qmail 25976 invoked from network); 11 Sep 2009 00:58:25 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 11 Sep 2009 00:58:25 -0000 Received: (qmail 42466 invoked by uid 500); 11 Sep 2009 00:58:25 -0000 Delivered-To: apmail-incubator-cassandra-user-archive@incubator.apache.org Received: (qmail 42443 invoked by uid 500); 11 Sep 2009 00:58:25 -0000 Mailing-List: contact cassandra-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: cassandra-user@incubator.apache.org Delivered-To: mailing list cassandra-user@incubator.apache.org Received: (qmail 42431 invoked by uid 99); 11 Sep 2009 00:58:24 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Sep 2009 00:58:24 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of mcorgan@hotpads.com designates 209.85.210.185 as permitted sender) Received: from [209.85.210.185] (HELO mail-yx0-f185.google.com) (209.85.210.185) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Sep 2009 00:58:16 +0000 Received: by yxe15 with SMTP id 15so813772yxe.13 for ; Thu, 10 Sep 2009 17:57:56 -0700 (PDT) MIME-Version: 1.0 Received: by 10.91.203.25 with SMTP id f25mr1272206agq.13.1252630676122; Thu, 10 Sep 2009 17:57:56 -0700 (PDT) From: Matt Corgan Date: Thu, 10 Sep 2009 20:57:36 -0400 Message-ID: Subject: SuperColumn vs range of Columns To: cassandra-user@incubator.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Hi, I've been watching some of the Cassandra presentation videos and looking through slides and the website, but I'm still missing the motivation behind SuperColumns. 1) What is the difference between a super-column like: homeAddress: { street: =E2=80=9C1234 x street=E2=80=9D, city: =E2=80=9Csan francisco=E2=80=9D, zip: =E2=80=9C94107=E2=80=B3, } and the BigTable or HBase style of concatenating nested keys together into something like: homeAddress/street:=E2=80=9D1234 x street=E2=80=9D, homeAddress/city: =E2=80=9Csan francisco=E2=80=9D, homeAddress/zip: =E2=80=9C94017=E2=80=B3 Wouldn=E2=80=99t they be sorted the same way on disk and be similarly efficient for range queries? Is it that you avoid storing the string =E2=80=9ChomeAddress=E2=80=9D redundantly? Maybe that really adds up if yo= u=E2=80=99re doing inbox search and storing billions of doc ids where the column name is several times the size of the doc id. Seems like BigTable/HBase could get a similar benefit by using prefix compression and omitting the timestamps. 2) Can SuperColumns only add one level of nesting beyond normal columns? That seems limiting considerng BigTable and HBase can append an arbitrary number of nested keys together. 3) Can you update the columns in the row of a supercolumn without overwriting the whole row? For example, if a facebook user sends his 10,000th message with the word Steelers in it, does that mean all 10,000 columns need to be overwritten (something like 100KB), or can a single column be sqeezed into the front of a supercolumn? Similarly, can you read a fraction of a SuperColumn without pulling the whole thing to the client? As far as i can tell, the only benefit of a SuperColumn over a bunch of Columns stored together is the savings you get by not storing the column name and timestamp over and over? What am I missing? Thanks! (maybe this could be added to an FAQ section on the project wiki) Matt