incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Daum <r...@thimbleware.com>
Subject Re: question about deleting from cassandra
Date Sun, 14 Mar 2010 14:29:33 GMT
+1, I'd like to try this patch but am running into error: patch failed:
src/java/org/apache/cassandra/utils/FBUtilities.java:342

Alternatively, someone could create a github fork which incorporates this
patch?

Ryan

On Sat, Mar 13, 2010 at 3:36 PM, Jonathan Ellis <jbellis@gmail.com> wrote:

> since they are separate changes, it's much easier to review if they
> are submitted separately.
>
> On 3/13/10, Weijun Li <weijunli@gmail.com> wrote:
> > Sure. I'm making another change for cross multiple DC replication, once
> this
> > one is done (probably in next week) I'll submit them together to Jira.
> All
> > based on 0.6 beta2.
> >
> > -Weijun
> >
> > -----Original Message-----
> > From: Jonathan Ellis [mailto:jbellis@gmail.com]
> > Sent: Saturday, March 13, 2010 5:36 AM
> > To: cassandra-user@incubator.apache.org
> > Subject: Re: question about deleting from cassandra
> >
> > You should submit your minor change to jira for others who might want to
> try
> > it.
> >
> > On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li <weijunli@gmail.com> wrote:
> >> Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked
> >> perfectly. Without this feature, as far as you have high volume new and
> >> expired columns your life will be miserable :-)
> >>
> >> Thanks for great job Sylvain!!
> >>
> >> -Weijun
> >>
> >> On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne <sylvain@yakaz.com>
> >> wrote:
> >>>
> >>> I guess you can also vote for this ticket :
> >>> https://issues.apache.org/jira/browse/CASSANDRA-699 :)
> >>>
> >>> </advertising>
> >>>
> >>> --
> >>> Sylvain
> >>>
> >>>
> >>> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson <markxr@gmail.com> wrote:
> >>> > On 12 March 2010 03:34, Bill Au <bill.w.au@gmail.com> wrote:
> >>> >>
> >>> >> Let take Twitter as an example.  All the tweets are timestamped.
 I
> >>> >> want
> >>> >> to keep only a month's worth of tweets for each user.  The number
of
> >>> >> tweets
> >>> >> that fit within this one month window varies from user to user.
>  What
> >>> >> is the
> >>> >> best way to accomplish this?
> >>> >
> >>> > This is the "expiry" problem that has been discussed on this list
> >>> > before. As
> >>> > far as I can see there are no easy ways to do it with 0.5
> >>> >
> >>> > If you use the ordered partitioner and make the first part of the
> keys
> > a
> >>> > timestamp (or part of it) then you can get the keys and delete them.
> >>> >
> >>> > However, these deletes will be quite inefficient, currently each row
> >>> > must be
> >>> > deleted individually (there was a patch to range delete kicking
> around,
> >>> > I
> >>> > don't know if it's accepted yet)
> >>> >
> >>> > But even if range delete is implemented, it's still quite inefficient
> >>> > and
> >>> > not really what you want, and doesn't work with the RandomPartitioner
> >>> >
> >>> > If you have some metadata to say who tweeted within a given period
> (say
> >>> > 10
> >>> > days or 30 days) and you store the tweets all in the same key per
> user
> >>> > per
> >>> > period (say with one column per tweet, or use supercolumns), then you
> >>> > can
> >>> > just delete one key per user per period.
> >>> >
> >>> > One of the problems with using a time-based key with ordered
> > partitioner
> >>> > is
> >>> > that you're always going to have a data imbalance, so you may want
to
> >>> > try
> >>> > hashing *part* of the key (The first part) so you can still range
> scan
> >>> > the
> >>> > next part. This may fix load balancing while still enabling you to
> use
> >>> > range
> >>> > scans to do data expiry.
> >>> >
> >>> > e.g. your key is
> >>> >
> >>> > Hash of day number + user id + timestamp
> >>> >
> >>> > Then you can range scan the entire day's tweets to expire them, and
> >>> > range
> >>> > scan a given user's tweets for a given day efficiently (and doing
> this
> >>> > for
> >>> > 30 days is just 30 range scans)
> >>> >
> >>> > Putting a hash in there fixes load balancing with OPP.
> >>> >
> >>> > Mark
> >>> >
> >>
> >>
> >
> >
>

Mime
View raw message