incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sylvain Lebresne <sylv...@yakaz.com>
Subject Re: question about deleting from cassandra
Date Thu, 18 Mar 2010 09:39:59 GMT
Hi,

I modified the patch to work against the current 0.6 svn branch (as I
needed it myself). I attached the files to jira if someone want to play
with it. Maybe should I remove the old files, as they were only working
against an old random svn trunk ?

--
Sylvain

On Mon, Mar 15, 2010 at 6:01 PM, Weijun Li <weijunli@gmail.com> wrote:
> OK I will try to separate them out.
>
> On Sat, Mar 13, 2010 at 5:35 AM, Jonathan Ellis <jbellis@gmail.com> wrote:
>>
>> You should submit your minor change to jira for others who might want to
>> try it.
>>
>> On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li <weijunli@gmail.com> wrote:
>> > Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked
>> > perfectly. Without this feature, as far as you have high volume new and
>> > expired columns your life will be miserable :-)
>> >
>> > Thanks for great job Sylvain!!
>> >
>> > -Weijun
>> >
>> > On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne <sylvain@yakaz.com>
>> > wrote:
>> >>
>> >> I guess you can also vote for this ticket :
>> >> https://issues.apache.org/jira/browse/CASSANDRA-699 :)
>> >>
>> >> </advertising>
>> >>
>> >> --
>> >> Sylvain
>> >>
>> >>
>> >> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson <markxr@gmail.com> wrote:
>> >> > On 12 March 2010 03:34, Bill Au <bill.w.au@gmail.com> wrote:
>> >> >>
>> >> >> Let take Twitter as an example.  All the tweets are timestamped.
 I
>> >> >> want
>> >> >> to keep only a month's worth of tweets for each user.  The number
of
>> >> >> tweets
>> >> >> that fit within this one month window varies from user to user.
>> >> >>  What
>> >> >> is the
>> >> >> best way to accomplish this?
>> >> >
>> >> > This is the "expiry" problem that has been discussed on this list
>> >> > before. As
>> >> > far as I can see there are no easy ways to do it with 0.5
>> >> >
>> >> > If you use the ordered partitioner and make the first part of the
>> >> > keys a
>> >> > timestamp (or part of it) then you can get the keys and delete them.
>> >> >
>> >> > However, these deletes will be quite inefficient, currently each row
>> >> > must be
>> >> > deleted individually (there was a patch to range delete kicking
>> >> > around,
>> >> > I
>> >> > don't know if it's accepted yet)
>> >> >
>> >> > But even if range delete is implemented, it's still quite inefficient
>> >> > and
>> >> > not really what you want, and doesn't work with the RandomPartitioner
>> >> >
>> >> > If you have some metadata to say who tweeted within a given period
>> >> > (say
>> >> > 10
>> >> > days or 30 days) and you store the tweets all in the same key per
>> >> > user
>> >> > per
>> >> > period (say with one column per tweet, or use supercolumns), then you
>> >> > can
>> >> > just delete one key per user per period.
>> >> >
>> >> > One of the problems with using a time-based key with ordered
>> >> > partitioner
>> >> > is
>> >> > that you're always going to have a data imbalance, so you may want
to
>> >> > try
>> >> > hashing *part* of the key (The first part) so you can still range
>> >> > scan
>> >> > the
>> >> > next part. This may fix load balancing while still enabling you to
>> >> > use
>> >> > range
>> >> > scans to do data expiry.
>> >> >
>> >> > e.g. your key is
>> >> >
>> >> > Hash of day number + user id + timestamp
>> >> >
>> >> > Then you can range scan the entire day's tweets to expire them, and
>> >> > range
>> >> > scan a given user's tweets for a given day efficiently (and doing
>> >> > this
>> >> > for
>> >> > 30 days is just 30 range scans)
>> >> >
>> >> > Putting a hash in there fixes load balancing with OPP.
>> >> >
>> >> > Mark
>> >> >
>> >
>> >
>
>

Mime
View raw message