incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Gallamore <mike.e.gallam...@googlemail.com>
Subject Re: effective modeling for fixed limit columns
Date Fri, 16 Apr 2010 20:06:48 GMT
The problem I'm working on is very similar to this. I'm working on a 
reputation system and we keep a fixed number of day buckets for the 
scores. So when new data comes in you need to find out what bucket is 
supposed to be used, remove the data in it if you've moved to a new 
bucket as the data there would be at least n + 1 days old where n is the 
number of days you are keeping, and then store the value. In our case we 
are willing to accept that we might occasionally lose a bit of data, as 
the data tends to trend towards a "good enough" value quite quickly. 
Still it would be cool to know a way to make sure that we really know 
that we are safe to "nuke" a bucket shy if always insisting on blocking 
writes on a "all" read. This can be very painful if you are replicating 
data across datacentres.
On 04/16/2010 11:50 AM, Chris Shorrock wrote:
> I'm attempting to come up with a technique for limiting the number of 
> columns a single key (or super column - doesn't matter too much for 
> the context of this conversation) may contain at any one time. My 
> actual use-case is a little too meaty to try to describe so an 
> alternate use-case of this mechanism could be:
>
>     /Construct a twitter-esque feed which maintains a list N tweets.
>     Tweets (in this system - and in reality I suppose) occur at such a
>     rate that you want to limit a given users "feed" to N items. You
>     do not have the ability to store an infinite number of tweets due
>     to the physical constraints of your hardware./
>
>
> The "/my first idea/" answer is when a tweet is inserted into the the 
> feed of a given person, that you then do a count and delete of any 
> outstanding tweets. In reality you could first count, then (if count 
> >= N) do a batch mutate for the insertion of the new entry and the 
> removal of the old. My issue with this approach is that after a 
> certain point every new entry into the system will incur the removal 
> of an old entry. The count, once a feed has reached N will always be 
> >= N on any subsequent queries. Depending on how you index the tweets 
> you may need to actually do a read instead of count to get the row 
> identifiers.
>
> My second approach was to utilize a "slot" system where you have a 
> record stored somewhere that indicates the next slot for insertion. 
> This can be thought of as a fixed length array where you store the 
> next insertion point in some other column family. When a new tweet 
> occurs you retrieve the current "slot" meta-data, insert into that 
> index, then update the meta-data for the next insertion. My concerns 
> with this relate around synchronization and losing entries due to 
> concurrent operations. I'd rather not have to something like ZooKeeper 
> to synchronize in the application cluster.
>
> I have some other ideas but I'm mostly just spit-balling at this 
> point. So I thought I'd reach out the collective intelligence of the 
> group to see if anyone has implemented something similar. Thanks in 
> advance.


Mime
View raw message