Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 77736 invoked from network); 9 Jul 2010 18:51:34 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 9 Jul 2010 18:51:34 -0000 Received: (qmail 68016 invoked by uid 500); 9 Jul 2010 18:51:33 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 67973 invoked by uid 500); 9 Jul 2010 18:51:32 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 67965 invoked by uid 99); 9 Jul 2010 18:51:32 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Jul 2010 18:51:32 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jbellis@gmail.com designates 209.85.212.172 as permitted sender) Received: from [209.85.212.172] (HELO mail-px0-f172.google.com) (209.85.212.172) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Jul 2010 18:51:26 +0000 Received: by pxi20 with SMTP id 20so1190643pxi.31 for ; Fri, 09 Jul 2010 11:51:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:received:in-reply-to :references:from:date:message-id:subject:to:content-type :content-transfer-encoding; bh=lKLhfjmhI6di6f6WaFUFrm2IhALbHxJRQB8bugDFFNQ=; b=N7NS9+C7eqFsEI1hYYBW+rZMO0N7CD/zQvt2Eu/sUNdPkimf++4VRGRKNRpAGb0huw eCpP5/vBZJaCq84r2TR0vSedDvPwQL/hqdf3qUnL+Dy+nMJLfKO0wgXws04VX4pk0m+c crzLYhJmV1TvbT5UI6UhVXiUv5xKThrwulJ6Q= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; b=dK5V14/fqhegInUVFhceqjp8Tw5xYv/wtNOq4CwB0wCTUYtuVjOjqggTxvTcTKPivt wJpRE/DoVhrFjSg9DmD3qJJsC9QpbxSMWKqM+Z1JIy2biwppyVlWOPYG+HTLtDscN7PI yNk33h5QomilDnueTwekwdGfFExAyQ7X0p8BA= Received: by 10.142.136.1 with SMTP id j1mr12103018wfd.182.1278701465196; Fri, 09 Jul 2010 11:51:05 -0700 (PDT) MIME-Version: 1.0 Received: by 10.143.28.5 with HTTP; Fri, 9 Jul 2010 11:50:45 -0700 (PDT) In-Reply-To: <1278605626.2947.62.camel@dell-desktop.example.com> References: <1278605626.2947.62.camel@dell-desktop.example.com> From: Jonathan Ellis Date: Fri, 9 Jul 2010 13:50:45 -0500 Message-ID: Subject: Re: Understanding atomicity in Cassandra To: user@cassandra.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org typically you will update both as part of a batch_mutate, and if it fails, retry the operation. re-writing any part that succeeded will be harmless. On Thu, Jul 8, 2010 at 11:13 AM, Stuart Langridge wrote: > Hi, Cassandra people! > > We're looking at Cassandra as a possible replacement for some parts of > our database structures, and on an early look I'm a bit confused about > atomicity guarantees and rollbacks and such, so I wanted to ask what > standard practice is for dealing with the sorts of situation I outline > below. > > Imagine that we're storing information about files. Each file has a path > and a uuid, and sometimes we need to look up stuff about a file by its > path and sometimes by its uuid. The best way to do this, as I understand > it, is to store the data in Cassandra twice: once indexed by nodeid and > once by path. So, I have two ColumnFamilies, one indexed by uuid: > > { > =A0"some-uuid-1": { > =A0 =A0"path": "/a/b/c", > =A0 =A0"size": 100000 > =A0}, > =A0"some-uuid-2" { > =A0 =A0... > =A0}, > =A0... > } > > and one indexed by path > > { > =A0"/a/b/c": { > =A0 =A0"uuid": "some-uuid-1", > =A0 =A0"size": 100000 > =A0}, > =A0"/d/e/f" { > =A0 =A0... > =A0}, > =A0... > } > > So, first, do please correct me if I've misunderstood the terminology > here (and I've shown a "short form" of ColumnFamily here, as per > http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model). > > The thing I don't quite get is: what happens when I want to add a new > file? I need to add it to both these ColumnFamilies, but there's no "add > it to both" atomic operation. What's the way that people handle the > situation where I add to the first CF and then my program crashes, so I > never added to the second? (Assume that there is lots more data than > I've outlined above, so that "put it all in one SuperColumnFamily, > because that can be updated atomically" won't work because it would end > up with our entire database in one SCF). Should we add to one, and then > if we fail to add to the other for some reason continually retry until > it works? Have a "garbage collection" procedure which finds > discrepancies between indexes like this and fixes them up and run it > from cron? We'd love to hear some advice on how to do this, or if we're > modelling the data in the wrong way and there's a better way which > avoids these problems! > > sil > > > --=20 Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com