incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stuart Langridge <stuart.langri...@canonical.com>
Subject Understanding atomicity in Cassandra
Date Thu, 08 Jul 2010 16:13:46 GMT
Hi, Cassandra people!

We're looking at Cassandra as a possible replacement for some parts of
our database structures, and on an early look I'm a bit confused about
atomicity guarantees and rollbacks and such, so I wanted to ask what
standard practice is for dealing with the sorts of situation I outline
below.

Imagine that we're storing information about files. Each file has a path
and a uuid, and sometimes we need to look up stuff about a file by its
path and sometimes by its uuid. The best way to do this, as I understand
it, is to store the data in Cassandra twice: once indexed by nodeid and
once by path. So, I have two ColumnFamilies, one indexed by uuid:

{
  "some-uuid-1": {
    "path": "/a/b/c",
    "size": 100000
  },
  "some-uuid-2" {
    ...
  },
  ...
}

and one indexed by path

{
  "/a/b/c": {
    "uuid": "some-uuid-1",
    "size": 100000
  },
  "/d/e/f" {
    ...
  },
  ...
}

So, first, do please correct me if I've misunderstood the terminology
here (and I've shown a "short form" of ColumnFamily here, as per
http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model).

The thing I don't quite get is: what happens when I want to add a new
file? I need to add it to both these ColumnFamilies, but there's no "add
it to both" atomic operation. What's the way that people handle the
situation where I add to the first CF and then my program crashes, so I
never added to the second? (Assume that there is lots more data than
I've outlined above, so that "put it all in one SuperColumnFamily,
because that can be updated atomically" won't work because it would end
up with our entire database in one SCF). Should we add to one, and then
if we fail to add to the other for some reason continually retry until
it works? Have a "garbage collection" procedure which finds
discrepancies between indexes like this and fixes them up and run it
from cron? We'd love to hear some advice on how to do this, or if we're
modelling the data in the wrong way and there's a better way which
avoids these problems!

sil



Mime
View raw message