Mailing-List: contact cassandra-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: cassandra-user@incubator.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Date: Thu, 26 Nov 2009 10:12:15 -0800
From: Anthony Molinaro <anthonym@alumni.caltech.edu>
To: cassandra-user@incubator.apache.org
Subject: Re: Modeling question
Message-ID: <20091126181215.GA65552@alumni.caltech.edu>
Mail-Followup-To: cassandra-user@incubator.apache.org
References: <828083e70911260052r62544f64rabac24914ae857f0@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <828083e70911260052r62544f64rabac24914ae857f0@mail.gmail.com>
User-Agent: Mutt/1.4.2.3i


On Thu, Nov 26, 2009 at 09:52:30AM +0100, gabriele renzi wrote:
> in my team, we are considering using cassandra for our project in
> place of a (pseudo)relational solution, but I am not sure on how we
> should handle a couple of modeling issues.
> Basically, my problem is how to bring into cassandra a db where
> elements are in the form <primary, secondary, attached data..> with
> the pair <primary, secondary> is unique (think <node, node, arc
> weight>) and we mostly do queries in the form
>    select secondary, data from db where primary= x  ---- perfect fit
> for cassandra
> 
> and in batch jobs we want to rewrite the whole thing _but_ using the
> other key for lookup, akin to
>  1. insert or update (primary, secondary, data) values (.. .. .. )  --
> we can do this using the primary lookup
>  2. delete from db where secondary = x and not in just inserted -- how
> do we do this?
> 
> it is my understanding that cassandra does not support secondary
> indexes so we would have to do a full scan to perform the #2
> operation, or we should mantain the second index by ourselves indexed
> on secondary and containing references to the primary.

Unless you are using order preserving partitioning which might or might not 
be what you want, you won't be able to do a full scan.  Instead you should
probably have two column families, one keyed by primary, one by secondary,
each with a column for the other, then you can do you operations.  It
uses more space, but disk is cheap so probably not a big deal.  If you
have to model a many-to-many relationship you can use super columns.

So I would imagine 2 super column families like

Primary Super Column
{ '<primary_id_0>' => { '<secondary_id_0>' => { 'data' => "<data_0>" },
                        '<secondary_id_1>' => { 'data' => "<data_1>" } }

Secondary Super Column
{ '<secondary_id_0>' => { '<primary_id_0>' => { 'data' => "<data_0>" },
  '<secondary_id_1>' => { '<primary_id_0>' => { 'data' => "<data_1>" }
}

You do your inserts into both, and for deletes you do a get_slice for the
secondary id, which will give you all primary ids which contain the
secondary id.  Then you can delete everything.

HTH,

-Anthony

-- 
------------------------------------------------------------------------
Anthony Molinaro                           <anthonym@alumni.caltech.edu>