incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sylvain Lebresne <sylv...@yakaz.com>
Subject Re: question about deleting from cassandra
Date Fri, 12 Mar 2010 08:27:33 GMT
I guess you can also vote for this ticket :
https://issues.apache.org/jira/browse/CASSANDRA-699 :)

</advertising>

--
Sylvain


On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson <markxr@gmail.com> wrote:
> On 12 March 2010 03:34, Bill Au <bill.w.au@gmail.com> wrote:
>>
>> Let take Twitter as an example.  All the tweets are timestamped.  I want
>> to keep only a month's worth of tweets for each user.  The number of tweets
>> that fit within this one month window varies from user to user.  What is the
>> best way to accomplish this?
>
> This is the "expiry" problem that has been discussed on this list before. As
> far as I can see there are no easy ways to do it with 0.5
>
> If you use the ordered partitioner and make the first part of the keys a
> timestamp (or part of it) then you can get the keys and delete them.
>
> However, these deletes will be quite inefficient, currently each row must be
> deleted individually (there was a patch to range delete kicking around, I
> don't know if it's accepted yet)
>
> But even if range delete is implemented, it's still quite inefficient and
> not really what you want, and doesn't work with the RandomPartitioner
>
> If you have some metadata to say who tweeted within a given period (say 10
> days or 30 days) and you store the tweets all in the same key per user per
> period (say with one column per tweet, or use supercolumns), then you can
> just delete one key per user per period.
>
> One of the problems with using a time-based key with ordered partitioner is
> that you're always going to have a data imbalance, so you may want to try
> hashing *part* of the key (The first part) so you can still range scan the
> next part. This may fix load balancing while still enabling you to use range
> scans to do data expiry.
>
> e.g. your key is
>
> Hash of day number + user id + timestamp
>
> Then you can range scan the entire day's tweets to expire them, and range
> scan a given user's tweets for a given day efficiently (and doing this for
> 30 days is just 30 range scans)
>
> Putting a hash in there fixes load balancing with OPP.
>
> Mark
>

Mime
View raw message