cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven A Robenalt <>
Subject Re: Handling data consistency
Date Tue, 25 Feb 2014 16:21:07 GMT
Hi Kasper,

Given the 2 most common queries you mention (top 5 and 5 nearest), I will
assume that these are relative to the current user, so basically, they
would correspond to the top 5 among my friends, and the 5 closest to my own
top score. If that is your intention, I would suggest that you could
compute both of these once at the time a write occurs (i.e. any new high
score) and store them separate from the main friends list (which would
still be updated as I suggested before). It wasn't clear to me if you're
using CQL or not, but if so, your user record could have 2 collection
columns for the top 5 and nearest 5, and I'm sure it'd be pretty easy to
accomplish the same thing with Thrift as well, though I don't have
experience in that area.

The top 5 list would be easy to update because each new high score would
potentially displace one value. The nearest 5 would be likely to be a new
set each time, so you'd just replace the entire column. This strategy does
put more load onto your writes, but would be very efficient to read - one
column for each set.

Other queries, such as scrolling through the list, could still be handled
from the full list with sorting, as I suggested before.

So if my assumptions above are valid, this still seems to me to be better
than using the score as part of the key and forcing read-before-write.

YMMV, of course...


On Tue, Feb 25, 2014 at 1:11 AM, Kasper Middelboe Petersen <> wrote:

> Hi Steve,
> I've considered this approach before and I'm partial to going this way
> again.
> The reason I haven't yet is the fact that I'm fairly confident that the
> user patterns would have a lot more highscore list reads than setting of
> highscores or adding friends. And most of the reads from highscores is
> either the 5 closest scores to a given score or the top-5. If a user has
> 100 friends it seems very excessive to have to read all these 100 entries
> to get the top 5. With the score as part of the clustering key this is
> possible to do efficiently.
> Do you have any thoughts on this?
> /Kasper
> On Tue, Feb 25, 2014 at 1:30 AM, Steven A Robenalt <>wrote:
>> Hi Kasper,
>> I am assuming that your friend list is symmetric (i.e. If I am your
>> friend then you are also my friend), which your comments seem to indicate.
>> First, I would suggest that you drop the friends score as a part of the
>> clustering key, which eliminates the need to read-before-write.
>> With that in mind, here's what I'd recommend:
>> 1) Each user has their own high score and timestamp as part of their own
>> attributes (thus keyed to their own user id).
>> 2) Each user also has a list of their own friends, keyed to their own
>> user id and the friend's user_id, and including the friend's high score and
>> updated time as attributes.
>> 3) When I add a friend, I copy their high score and timestamp to my own
>> friends list, and I copy my current high score and timestamp to their
>> friends list.
>> 4) When I update my own high score, I grab the ids of all of my friends
>> from my own friends list and push my new high score and timestamp to each
>> of them.
>> 5) When I remove a friend, I delete myself from their friends list and
>> them from mine.
>> 6) When I view any friends list, the result must be sorted at that time,
>> presumably by descending high score (since the score is no longer part of
>> the clustering key).
>> There's a couple of potential conflicts in this strategy, which may occur
>> (for example) if my friend updates a high score at the same time I drop
>> them as a friend (they would end up re-inserting themselves to my friend
>> list after I dropped them). In this case, I'd need to simply drop them
>> again. This may or may not be acceptable in your application. Most conflict
>> situations could be resolved with use of lightweight transactions if you
>> want to sacrifice performance (and assuming that the condition happens
>> often enough to warrant such treatment).
>> For performance reasons, it may also be desirable to lump all of the high
>> score updates for a user in a batch statement if you have the option to do
>> so.
>> Anyway, that eliminates the read-before-write problem, and also insures
>> that if a high score is missed at the time a new user is added as a friend,
>> it can at least be updated later with the correct value. Not sure how much
>> help that is, but maybe it'll give you some ideas you can experiment with.
>> Steve
>> On Mon, Feb 24, 2014 at 9:47 AM, Kasper Middelboe Petersen <
>>> wrote:
>>> Hi,
>>> My requirements include a system that can handle friend based highscore
>>> lists (as a user I have a bunch of friends from various social sites like
>>> Facebook). The user must have a highscore list that consist of his friends
>>> only.
>>> I have implemented this using the users ID as partition key and the
>>> friends score and id as clustering keys. This keeps reads of the highscore
>>> list fast and straight forward.
>>> Updating a highscore is a bit cumbersome and suffers from
>>> read-before-write, but can largely be done without too much worry. The big
>>> issue here is I need to know the exact old highscore to be able to set a
>>> new one.
>>> The big problem arise when a new user connects and have many friends
>>> that needs to be added. This can end up being quite an extensive amount of
>>> queries that has to happen and could take some time to do:
>>>  - Lookup the friends user id based on the social credentials
>>>  - Lookup the friends highscore
>>>  - Lookup the users own highscore
>>>  - Add friend with his highscore to self
>>>  - Add self to with own highscore to friend
>>>  - Do update to self with lastUpdatedFriends timestamp
>>>  - Do update to friend with lastUpdatedFriends timestamp
>>> Now if a new user has a bunch of friends this could end up being quite a
>>> lot of queries - all suffering from the read-before-write problem. Should
>>> any of the friends set a new highscore between the lookup and the writes
>>> the highscore would never be set correctly and duplicates would happen.
>>> I'm open to any suggestions ranging from how to model this differently
>>> to avoid the read-before-write to how to do this without risking having
>>> duplicate data that would be extremely painful to try and find again in
>>> highscore lists?
>>> Thanks,
>>> Kasper
>> --
>> Steve Robenalt
>> Software Architect
>>  HighWire | Stanford University
>> 425 Broadway St, Redwood City, CA 94063

Steve Robenalt
Software Architect
HighWire | Stanford University
425 Broadway St, Redwood City, CA 94063

View raw message