Mailing-List: contact cassandra-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: cassandra-user@incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of mcorgan@hotpads.com
 designates 209.85.210.185 as permitted sender)
MIME-Version: 1.0
From: Matt Corgan <mcorgan@hotpads.com>
Date: Thu, 10 Sep 2009 20:57:36 -0400
Message-ID: <ee2c0d9b0909101757t37504bedj7506fa0a180c4fbd@mail.gmail.com>
Subject: SuperColumn vs range of Columns
To: cassandra-user@incubator.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi,
I've been watching some of the Cassandra presentation videos and
looking through slides and the website, but I'm still missing the
motivation behind SuperColumns.

1) What is the difference between a super-column like:

homeAddress: {
  street: =E2=80=9C1234 x street=E2=80=9D,
  city: =E2=80=9Csan francisco=E2=80=9D,
  zip: =E2=80=9C94107=E2=80=B3,
}

and the BigTable or HBase style of concatenating nested keys together
into something like:

homeAddress/street:=E2=80=9D1234 x street=E2=80=9D,
homeAddress/city: =E2=80=9Csan francisco=E2=80=9D,
homeAddress/zip: =E2=80=9C94017=E2=80=B3

Wouldn=E2=80=99t they be sorted the same way on disk and be similarly
efficient for range queries?  Is it that you avoid storing the string
=E2=80=9ChomeAddress=E2=80=9D redundantly?  Maybe that really adds up if yo=
u=E2=80=99re doing
inbox search and storing billions of doc ids where the column name is
several times the size of the doc id.  Seems like BigTable/HBase could
get a similar benefit by using prefix compression and omitting the
timestamps.


2) Can SuperColumns only add one level of nesting beyond normal
columns? That seems limiting considerng BigTable and HBase can append
an arbitrary number of nested keys together.


3) Can you update the columns in the row of a supercolumn without
overwriting the whole row? For example, if a facebook user sends his
10,000th message with the word Steelers in it, does that mean all
10,000 columns need to be overwritten (something like 100KB), or can a
single column be sqeezed into the front of a supercolumn?  Similarly,
can you read a fraction of a SuperColumn without pulling the whole
thing to the client?

As far as i can tell, the only benefit of a SuperColumn over a bunch
of Columns stored together is the savings you get by not storing the
column name and timestamp over and over?  What am I missing?

Thanks!  (maybe this could be added to an FAQ section on the project wiki)

Matt