Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of jbellis@gmail.com designates
 209.85.212.172 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :content-type:content-transfer-encoding;
        b=dK5V14/fqhegInUVFhceqjp8Tw5xYv/wtNOq4CwB0wCTUYtuVjOjqggTxvTcTKPivt
         wJpRE/DoVhrFjSg9DmD3qJJsC9QpbxSMWKqM+Z1JIy2biwppyVlWOPYG+HTLtDscN7PI
         yNk33h5QomilDnueTwekwdGfFExAyQ7X0p8BA=
MIME-Version: 1.0
In-Reply-To: <1278605626.2947.62.camel@dell-desktop.example.com>
References: <1278605626.2947.62.camel@dell-desktop.example.com>
From: Jonathan Ellis <jbellis@gmail.com>
Date: Fri, 9 Jul 2010 13:50:45 -0500
Message-ID: <AANLkTimXogiTyiWPnHd_1aQAmPgLSef7ByT0RBiXLlCf@mail.gmail.com>
Subject: Re: Understanding atomicity in Cassandra
To: user@cassandra.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

typically you will update both as part of a batch_mutate, and if it
fails, retry the operation.  re-writing any part that succeeded will
be harmless.

On Thu, Jul 8, 2010 at 11:13 AM, Stuart Langridge
<stuart.langridge@canonical.com> wrote:
> Hi, Cassandra people!
>
> We're looking at Cassandra as a possible replacement for some parts of
> our database structures, and on an early look I'm a bit confused about
> atomicity guarantees and rollbacks and such, so I wanted to ask what
> standard practice is for dealing with the sorts of situation I outline
> below.
>
> Imagine that we're storing information about files. Each file has a path
> and a uuid, and sometimes we need to look up stuff about a file by its
> path and sometimes by its uuid. The best way to do this, as I understand
> it, is to store the data in Cassandra twice: once indexed by nodeid and
> once by path. So, I have two ColumnFamilies, one indexed by uuid:
>
> {
> =A0"some-uuid-1": {
> =A0 =A0"path": "/a/b/c",
> =A0 =A0"size": 100000
> =A0},
> =A0"some-uuid-2" {
> =A0 =A0...
> =A0},
> =A0...
> }
>
> and one indexed by path
>
> {
> =A0"/a/b/c": {
> =A0 =A0"uuid": "some-uuid-1",
> =A0 =A0"size": 100000
> =A0},
> =A0"/d/e/f" {
> =A0 =A0...
> =A0},
> =A0...
> }
>
> So, first, do please correct me if I've misunderstood the terminology
> here (and I've shown a "short form" of ColumnFamily here, as per
> http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model).
>
> The thing I don't quite get is: what happens when I want to add a new
> file? I need to add it to both these ColumnFamilies, but there's no "add
> it to both" atomic operation. What's the way that people handle the
> situation where I add to the first CF and then my program crashes, so I
> never added to the second? (Assume that there is lots more data than
> I've outlined above, so that "put it all in one SuperColumnFamily,
> because that can be updated atomically" won't work because it would end
> up with our entire database in one SCF). Should we add to one, and then
> if we fail to add to the other for some reason continually retry until
> it works? Have a "garbage collection" procedure which finds
> discrepancies between indexes like this and fixes them up and run it
> from cron? We'd love to hear some advice on how to do this, or if we're
> modelling the data in the wrong way and there's a better way which
> avoids these problems!
>
> sil
>
>
>


--=20
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com