incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Weijun Li" <weiju...@gmail.com>
Subject RE: question about deleting from cassandra
Date Sat, 13 Mar 2010 18:16:20 GMT
Sure. I'm making another change for cross multiple DC replication, once this
one is done (probably in next week) I'll submit them together to Jira. All
based on 0.6 beta2.

-Weijun

-----Original Message-----
From: Jonathan Ellis [mailto:jbellis@gmail.com] 
Sent: Saturday, March 13, 2010 5:36 AM
To: cassandra-user@incubator.apache.org
Subject: Re: question about deleting from cassandra

You should submit your minor change to jira for others who might want to try
it.

On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li <weijunli@gmail.com> wrote:
> Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked
> perfectly. Without this feature, as far as you have high volume new and
> expired columns your life will be miserable :-)
>
> Thanks for great job Sylvain!!
>
> -Weijun
>
> On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne <sylvain@yakaz.com>
> wrote:
>>
>> I guess you can also vote for this ticket :
>> https://issues.apache.org/jira/browse/CASSANDRA-699 :)
>>
>> </advertising>
>>
>> --
>> Sylvain
>>
>>
>> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson <markxr@gmail.com> wrote:
>> > On 12 March 2010 03:34, Bill Au <bill.w.au@gmail.com> wrote:
>> >>
>> >> Let take Twitter as an example.  All the tweets are timestamped.  I
>> >> want
>> >> to keep only a month's worth of tweets for each user.  The number of
>> >> tweets
>> >> that fit within this one month window varies from user to user.  What
>> >> is the
>> >> best way to accomplish this?
>> >
>> > This is the "expiry" problem that has been discussed on this list
>> > before. As
>> > far as I can see there are no easy ways to do it with 0.5
>> >
>> > If you use the ordered partitioner and make the first part of the keys
a
>> > timestamp (or part of it) then you can get the keys and delete them.
>> >
>> > However, these deletes will be quite inefficient, currently each row
>> > must be
>> > deleted individually (there was a patch to range delete kicking around,
>> > I
>> > don't know if it's accepted yet)
>> >
>> > But even if range delete is implemented, it's still quite inefficient
>> > and
>> > not really what you want, and doesn't work with the RandomPartitioner
>> >
>> > If you have some metadata to say who tweeted within a given period (say
>> > 10
>> > days or 30 days) and you store the tweets all in the same key per user
>> > per
>> > period (say with one column per tweet, or use supercolumns), then you
>> > can
>> > just delete one key per user per period.
>> >
>> > One of the problems with using a time-based key with ordered
partitioner
>> > is
>> > that you're always going to have a data imbalance, so you may want to
>> > try
>> > hashing *part* of the key (The first part) so you can still range scan
>> > the
>> > next part. This may fix load balancing while still enabling you to use
>> > range
>> > scans to do data expiry.
>> >
>> > e.g. your key is
>> >
>> > Hash of day number + user id + timestamp
>> >
>> > Then you can range scan the entire day's tweets to expire them, and
>> > range
>> > scan a given user's tweets for a given day efficiently (and doing this
>> > for
>> > 30 days is just 30 range scans)
>> >
>> > Putting a hash in there fixes load balancing with OPP.
>> >
>> > Mark
>> >
>
>


Mime
View raw message