cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patricio Echagüe <patric...@gmail.com>
Subject Re: Understanding atomicity in Cassandra
Date Tue, 20 Jul 2010 20:08:00 GMT
Hi, regarding the retrying strategy, I understand that it might make
sense assuming that the client can actually perform a retry.

We are trying to build a fault tolerance solution based on Cassandra.
In some scenarios, the client machine can go down during a
transaction.

Would it be bad design to store all the data that need to be
consistent under one big key? In this case the batch_mutate operations
will not be big since just a small part is updated/add at a time. But
at least we know that the operation either succeeded or failed.

We basically have:

CF: usernames (similar to Twitter model)
SCF: User_tree (it has all the information related to the user)

Thanks

On Mon, Jul 19, 2010 at 9:40 PM, Alex Yiu <bigcontentflow@gmail.com> wrote:
>
> Hi, Stuart,
> If I may paraphrase what Jonathan said, typically your batch_mutate
> operation is idempotent.
> That is, you can replay / retry the same operation within a short timeframe
> without any undesirable side effect.
> The assumption behind the "short timeframe" here refers to: there is no
> other concurrent writer trying to write anything conflicting in an
> interleaving fashion.
> Imagine that if there was another writer trying to write:
>>  "some-uuid-1": {
>>    "path": "/foo/bar",
>>    "size": 100000
>>  },
> ...
>> {
>>  "/foo/bar": {
>>    "uuid": "some-uuid-1"
>>  },
> Then, there is a chance of 4 write operations (two writes for "/a/b/c" into
> 2 CFs and two writes for "/foo/bar" into 2) would interleave each other and
> create an undesirable result.
> I guess that is not a likely situation in your case.
> Hopefully, my email helps.
> See also:
> http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic
>
> Regards,
> Alex Yiu
>
>
> On Fri, Jul 9, 2010 at 11:50 AM, Jonathan Ellis <jbellis@gmail.com> wrote:
>>
>> typically you will update both as part of a batch_mutate, and if it
>> fails, retry the operation.  re-writing any part that succeeded will
>> be harmless.
>>
>> On Thu, Jul 8, 2010 at 11:13 AM, Stuart Langridge
>> <stuart.langridge@canonical.com> wrote:
>> > Hi, Cassandra people!
>> >
>> > We're looking at Cassandra as a possible replacement for some parts of
>> > our database structures, and on an early look I'm a bit confused about
>> > atomicity guarantees and rollbacks and such, so I wanted to ask what
>> > standard practice is for dealing with the sorts of situation I outline
>> > below.
>> >
>> > Imagine that we're storing information about files. Each file has a path
>> > and a uuid, and sometimes we need to look up stuff about a file by its
>> > path and sometimes by its uuid. The best way to do this, as I understand
>> > it, is to store the data in Cassandra twice: once indexed by nodeid and
>> > once by path. So, I have two ColumnFamilies, one indexed by uuid:
>> >
>> > {
>> >  "some-uuid-1": {
>> >    "path": "/a/b/c",
>> >    "size": 100000
>> >  },
>> >  "some-uuid-2" {
>> >    ...
>> >  },
>> >  ...
>> > }
>> >
>> > and one indexed by path
>> >
>> > {
>> >  "/a/b/c": {
>> >    "uuid": "some-uuid-1",
>> >    "size": 100000
>> >  },
>> >  "/d/e/f" {
>> >    ...
>> >  },
>> >  ...
>> > }
>> >
>> > So, first, do please correct me if I've misunderstood the terminology
>> > here (and I've shown a "short form" of ColumnFamily here, as per
>> > http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model).
>> >
>> > The thing I don't quite get is: what happens when I want to add a new
>> > file? I need to add it to both these ColumnFamilies, but there's no "add
>> > it to both" atomic operation. What's the way that people handle the
>> > situation where I add to the first CF and then my program crashes, so I
>> > never added to the second? (Assume that there is lots more data than
>> > I've outlined above, so that "put it all in one SuperColumnFamily,
>> > because that can be updated atomically" won't work because it would end
>> > up with our entire database in one SCF). Should we add to one, and then
>> > if we fail to add to the other for some reason continually retry until
>> > it works? Have a "garbage collection" procedure which finds
>> > discrepancies between indexes like this and fixes them up and run it
>> > from cron? We'd love to hear some advice on how to do this, or if we're
>> > modelling the data in the wrong way and there's a better way which
>> > avoids these problems!
>> >
>> > sil
>> >
>> >
>> >
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of Riptano, the source for professional Cassandra support
>> http://riptano.com
>
>



-- 
Patricio.-

Mime
View raw message