incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Ellis <jbel...@gmail.com>
Subject Re: question about deleting from cassandra
Date Sat, 13 Mar 2010 20:36:58 GMT
since they are separate changes, it's much easier to review if they
are submitted separately.

On 3/13/10, Weijun Li <weijunli@gmail.com> wrote:
> Sure. I'm making another change for cross multiple DC replication, once this
> one is done (probably in next week) I'll submit them together to Jira. All
> based on 0.6 beta2.
>
> -Weijun
>
> -----Original Message-----
> From: Jonathan Ellis [mailto:jbellis@gmail.com]
> Sent: Saturday, March 13, 2010 5:36 AM
> To: cassandra-user@incubator.apache.org
> Subject: Re: question about deleting from cassandra
>
> You should submit your minor change to jira for others who might want to try
> it.
>
> On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li <weijunli@gmail.com> wrote:
>> Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked
>> perfectly. Without this feature, as far as you have high volume new and
>> expired columns your life will be miserable :-)
>>
>> Thanks for great job Sylvain!!
>>
>> -Weijun
>>
>> On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne <sylvain@yakaz.com>
>> wrote:
>>>
>>> I guess you can also vote for this ticket :
>>> https://issues.apache.org/jira/browse/CASSANDRA-699 :)
>>>
>>> </advertising>
>>>
>>> --
>>> Sylvain
>>>
>>>
>>> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson <markxr@gmail.com> wrote:
>>> > On 12 March 2010 03:34, Bill Au <bill.w.au@gmail.com> wrote:
>>> >>
>>> >> Let take Twitter as an example.  All the tweets are timestamped.  I
>>> >> want
>>> >> to keep only a month's worth of tweets for each user.  The number of
>>> >> tweets
>>> >> that fit within this one month window varies from user to user.  What
>>> >> is the
>>> >> best way to accomplish this?
>>> >
>>> > This is the "expiry" problem that has been discussed on this list
>>> > before. As
>>> > far as I can see there are no easy ways to do it with 0.5
>>> >
>>> > If you use the ordered partitioner and make the first part of the keys
> a
>>> > timestamp (or part of it) then you can get the keys and delete them.
>>> >
>>> > However, these deletes will be quite inefficient, currently each row
>>> > must be
>>> > deleted individually (there was a patch to range delete kicking around,
>>> > I
>>> > don't know if it's accepted yet)
>>> >
>>> > But even if range delete is implemented, it's still quite inefficient
>>> > and
>>> > not really what you want, and doesn't work with the RandomPartitioner
>>> >
>>> > If you have some metadata to say who tweeted within a given period (say
>>> > 10
>>> > days or 30 days) and you store the tweets all in the same key per user
>>> > per
>>> > period (say with one column per tweet, or use supercolumns), then you
>>> > can
>>> > just delete one key per user per period.
>>> >
>>> > One of the problems with using a time-based key with ordered
> partitioner
>>> > is
>>> > that you're always going to have a data imbalance, so you may want to
>>> > try
>>> > hashing *part* of the key (The first part) so you can still range scan
>>> > the
>>> > next part. This may fix load balancing while still enabling you to use
>>> > range
>>> > scans to do data expiry.
>>> >
>>> > e.g. your key is
>>> >
>>> > Hash of day number + user id + timestamp
>>> >
>>> > Then you can range scan the entire day's tweets to expire them, and
>>> > range
>>> > scan a given user's tweets for a given day efficiently (and doing this
>>> > for
>>> > 30 days is just 30 range scans)
>>> >
>>> > Putting a hash in there fixes load balancing with OPP.
>>> >
>>> > Mark
>>> >
>>
>>
>
>

Mime
View raw message