incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Weijun Li <weiju...@gmail.com>
Subject Re: question about deleting from cassandra
Date Mon, 15 Mar 2010 17:01:09 GMT
OK I will try to separate them out.

On Sat, Mar 13, 2010 at 5:35 AM, Jonathan Ellis <jbellis@gmail.com> wrote:

> You should submit your minor change to jira for others who might want to
> try it.
>
> On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li <weijunli@gmail.com> wrote:
> > Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked
> > perfectly. Without this feature, as far as you have high volume new and
> > expired columns your life will be miserable :-)
> >
> > Thanks for great job Sylvain!!
> >
> > -Weijun
> >
> > On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne <sylvain@yakaz.com>
> > wrote:
> >>
> >> I guess you can also vote for this ticket :
> >> https://issues.apache.org/jira/browse/CASSANDRA-699 :)
> >>
> >> </advertising>
> >>
> >> --
> >> Sylvain
> >>
> >>
> >> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson <markxr@gmail.com> wrote:
> >> > On 12 March 2010 03:34, Bill Au <bill.w.au@gmail.com> wrote:
> >> >>
> >> >> Let take Twitter as an example.  All the tweets are timestamped.  I
> >> >> want
> >> >> to keep only a month's worth of tweets for each user.  The number of
> >> >> tweets
> >> >> that fit within this one month window varies from user to user.  What
> >> >> is the
> >> >> best way to accomplish this?
> >> >
> >> > This is the "expiry" problem that has been discussed on this list
> >> > before. As
> >> > far as I can see there are no easy ways to do it with 0.5
> >> >
> >> > If you use the ordered partitioner and make the first part of the keys
> a
> >> > timestamp (or part of it) then you can get the keys and delete them.
> >> >
> >> > However, these deletes will be quite inefficient, currently each row
> >> > must be
> >> > deleted individually (there was a patch to range delete kicking
> around,
> >> > I
> >> > don't know if it's accepted yet)
> >> >
> >> > But even if range delete is implemented, it's still quite inefficient
> >> > and
> >> > not really what you want, and doesn't work with the RandomPartitioner
> >> >
> >> > If you have some metadata to say who tweeted within a given period
> (say
> >> > 10
> >> > days or 30 days) and you store the tweets all in the same key per user
> >> > per
> >> > period (say with one column per tweet, or use supercolumns), then you
> >> > can
> >> > just delete one key per user per period.
> >> >
> >> > One of the problems with using a time-based key with ordered
> partitioner
> >> > is
> >> > that you're always going to have a data imbalance, so you may want to
> >> > try
> >> > hashing *part* of the key (The first part) so you can still range scan
> >> > the
> >> > next part. This may fix load balancing while still enabling you to use
> >> > range
> >> > scans to do data expiry.
> >> >
> >> > e.g. your key is
> >> >
> >> > Hash of day number + user id + timestamp
> >> >
> >> > Then you can range scan the entire day's tweets to expire them, and
> >> > range
> >> > scan a given user's tweets for a given day efficiently (and doing this
> >> > for
> >> > 30 days is just 30 range scans)
> >> >
> >> > Putting a hash in there fixes load balancing with OPP.
> >> >
> >> > Mark
> >> >
> >
> >
>

Mime
View raw message