incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Weijun Li <weiju...@gmail.com>
Subject Re: question about deleting from cassandra
Date Sat, 13 Mar 2010 09:18:26 GMT
Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked
perfectly. Without this feature, as far as you have high volume new and
expired columns your life will be miserable :-)

Thanks for great job Sylvain!!

-Weijun

On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne <sylvain@yakaz.com>wrote:

> I guess you can also vote for this ticket :
> https://issues.apache.org/jira/browse/CASSANDRA-699 :)
>
> </advertising>
>
> --
> Sylvain
>
>
> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson <markxr@gmail.com> wrote:
> > On 12 March 2010 03:34, Bill Au <bill.w.au@gmail.com> wrote:
> >>
> >> Let take Twitter as an example.  All the tweets are timestamped.  I want
> >> to keep only a month's worth of tweets for each user.  The number of
> tweets
> >> that fit within this one month window varies from user to user.  What is
> the
> >> best way to accomplish this?
> >
> > This is the "expiry" problem that has been discussed on this list before.
> As
> > far as I can see there are no easy ways to do it with 0.5
> >
> > If you use the ordered partitioner and make the first part of the keys a
> > timestamp (or part of it) then you can get the keys and delete them.
> >
> > However, these deletes will be quite inefficient, currently each row must
> be
> > deleted individually (there was a patch to range delete kicking around, I
> > don't know if it's accepted yet)
> >
> > But even if range delete is implemented, it's still quite inefficient and
> > not really what you want, and doesn't work with the RandomPartitioner
> >
> > If you have some metadata to say who tweeted within a given period (say
> 10
> > days or 30 days) and you store the tweets all in the same key per user
> per
> > period (say with one column per tweet, or use supercolumns), then you can
> > just delete one key per user per period.
> >
> > One of the problems with using a time-based key with ordered partitioner
> is
> > that you're always going to have a data imbalance, so you may want to try
> > hashing *part* of the key (The first part) so you can still range scan
> the
> > next part. This may fix load balancing while still enabling you to use
> range
> > scans to do data expiry.
> >
> > e.g. your key is
> >
> > Hash of day number + user id + timestamp
> >
> > Then you can range scan the entire day's tweets to expire them, and range
> > scan a given user's tweets for a given day efficiently (and doing this
> for
> > 30 days is just 30 range scans)
> >
> > Putting a hash in there fixes load balancing with OPP.
> >
> > Mark
> >
>

Mime
View raw message