incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David King <>
Subject Maintaining secondary indices on mutable columns
Date Fri, 16 Apr 2010 22:01:27 GMT
I'm having trouble visualising how to maintain a secondary index on a mutable column. For instance,
given some objects and the number that we have in inventory:

widget1: { count = 0, colour = blue }
widget2: { count = 5, colour = red }
widget3: { count = 8, colour = green }
widget4: { count = 8, colour = purple }
widget5: { count = 9, colour = green }
widget6: { count = 2, colour = red }
widget7: { count = 1, colour = blue }

If i want a list of widgets ordered by the number that we have, I might construct an ordered
columnfamily that looks like:

count009_widget5: { id = widget5 }
count008_widget4: { id = widget4 }
count008_widget3: { id = widget3 }
count005_widget2: { id = widget2 }
count002_widget6: { id = widget6 }
count001_widget7: { id = widget7 }
count000_widget1: { id = widget1 }

And do slices against that CF. (The additional _widget5 bit of the key is just so that duplicates
don't clobber each other.) This works great if the inventory-count is immutable with respect
to the ID (like a timestamp, as in the Twissandra example, which does exactly this), but if
it's not then whenever we update the count, we have to also update the secondary index CF.
Let's say we want to increment the inventory count of widget2. This presents some problems:

1. We're going to have to remove the old count005_widget2 row (action A) and add a count006_widget2
row (action B). If we do A first, then B, then for some period widget2 doesn't show up in
our inventory report at all. If we do B first then A, then it shows it twice for a bit. If
this is a heavily-trafficed CF, then at any given time there are likely to be many such inconsistencies
adding up and querying a slice is unlikely to be useful information at all.

2. How do we find the old column name "count005_widget2" to remove it? If it's a trafficked
column we can't use the current count before the increment because that's rife for race-conditions,
and if it's a big CF we can't scan the whole thing looking for that ID. We could devise a
hack by guessing that the inventory is unlikely to change by more than N% at a time, and pull
the slice count004*.. count010* and only scan that slice for the deletions, but that imposes
a new race-condition where we have simultaneous agents incrementing the count, decrementing
it, and incrementing it (that is, the decrement may delete the row that the second increment
created, and now none of them are in the index).

Has anybody devised a toolkit or guidelines for working with this situation?
View raw message