incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anthony Molinaro <antho...@alumni.caltech.edu>
Subject Re: Modeling question
Date Thu, 26 Nov 2009 18:12:15 GMT

On Thu, Nov 26, 2009 at 09:52:30AM +0100, gabriele renzi wrote:
> in my team, we are considering using cassandra for our project in
> place of a (pseudo)relational solution, but I am not sure on how we
> should handle a couple of modeling issues.
> Basically, my problem is how to bring into cassandra a db where
> elements are in the form <primary, secondary, attached data..> with
> the pair <primary, secondary> is unique (think <node, node, arc
> weight>) and we mostly do queries in the form
>    select secondary, data from db where primary= x  ---- perfect fit
> for cassandra
> 
> and in batch jobs we want to rewrite the whole thing _but_ using the
> other key for lookup, akin to
>  1. insert or update (primary, secondary, data) values (.. .. .. )  --
> we can do this using the primary lookup
>  2. delete from db where secondary = x and not in just inserted -- how
> do we do this?
> 
> it is my understanding that cassandra does not support secondary
> indexes so we would have to do a full scan to perform the #2
> operation, or we should mantain the second index by ourselves indexed
> on secondary and containing references to the primary.

Unless you are using order preserving partitioning which might or might not 
be what you want, you won't be able to do a full scan.  Instead you should
probably have two column families, one keyed by primary, one by secondary,
each with a column for the other, then you can do you operations.  It
uses more space, but disk is cheap so probably not a big deal.  If you
have to model a many-to-many relationship you can use super columns.

So I would imagine 2 super column families like

Primary Super Column
{ '<primary_id_0>' => { '<secondary_id_0>' => { 'data' => "<data_0>"
},
                        '<secondary_id_1>' => { 'data' => "<data_1>" } }

Secondary Super Column
{ '<secondary_id_0>' => { '<primary_id_0>' => { 'data' => "<data_0>"
},
  '<secondary_id_1>' => { '<primary_id_0>' => { 'data' => "<data_1>"
}
}

You do your inserts into both, and for deletes you do a get_slice for the
secondary id, which will give you all primary ids which contain the
secondary id.  Then you can delete everything.

HTH,

-Anthony

-- 
------------------------------------------------------------------------
Anthony Molinaro                           <anthonym@alumni.caltech.edu>

Mime
View raw message