incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Ellis <jbel...@gmail.com>
Subject Re: question about deleting from cassandra
Date Sat, 13 Mar 2010 13:35:56 GMT
You should submit your minor change to jira for others who might want to try it.

On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li <weijunli@gmail.com> wrote:
> Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked
> perfectly. Without this feature, as far as you have high volume new and
> expired columns your life will be miserable :-)
>
> Thanks for great job Sylvain!!
>
> -Weijun
>
> On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne <sylvain@yakaz.com>
> wrote:
>>
>> I guess you can also vote for this ticket :
>> https://issues.apache.org/jira/browse/CASSANDRA-699 :)
>>
>> </advertising>
>>
>> --
>> Sylvain
>>
>>
>> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson <markxr@gmail.com> wrote:
>> > On 12 March 2010 03:34, Bill Au <bill.w.au@gmail.com> wrote:
>> >>
>> >> Let take Twitter as an example.  All the tweets are timestamped.  I
>> >> want
>> >> to keep only a month's worth of tweets for each user.  The number of
>> >> tweets
>> >> that fit within this one month window varies from user to user.  What
>> >> is the
>> >> best way to accomplish this?
>> >
>> > This is the "expiry" problem that has been discussed on this list
>> > before. As
>> > far as I can see there are no easy ways to do it with 0.5
>> >
>> > If you use the ordered partitioner and make the first part of the keys a
>> > timestamp (or part of it) then you can get the keys and delete them.
>> >
>> > However, these deletes will be quite inefficient, currently each row
>> > must be
>> > deleted individually (there was a patch to range delete kicking around,
>> > I
>> > don't know if it's accepted yet)
>> >
>> > But even if range delete is implemented, it's still quite inefficient
>> > and
>> > not really what you want, and doesn't work with the RandomPartitioner
>> >
>> > If you have some metadata to say who tweeted within a given period (say
>> > 10
>> > days or 30 days) and you store the tweets all in the same key per user
>> > per
>> > period (say with one column per tweet, or use supercolumns), then you
>> > can
>> > just delete one key per user per period.
>> >
>> > One of the problems with using a time-based key with ordered partitioner
>> > is
>> > that you're always going to have a data imbalance, so you may want to
>> > try
>> > hashing *part* of the key (The first part) so you can still range scan
>> > the
>> > next part. This may fix load balancing while still enabling you to use
>> > range
>> > scans to do data expiry.
>> >
>> > e.g. your key is
>> >
>> > Hash of day number + user id + timestamp
>> >
>> > Then you can range scan the entire day's tweets to expire them, and
>> > range
>> > scan a given user's tweets for a given day efficiently (and doing this
>> > for
>> > 30 days is just 30 range scans)
>> >
>> > Putting a hash in there fixes load balancing with OPP.
>> >
>> > Mark
>> >
>
>

Mime
View raw message