Return-Path: Delivered-To: apmail-incubator-cassandra-user-archive@minotaur.apache.org Received: (qmail 5829 invoked from network); 13 Mar 2010 13:37:25 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 13 Mar 2010 13:37:25 -0000 Received: (qmail 31731 invoked by uid 500); 13 Mar 2010 13:36:45 -0000 Delivered-To: apmail-incubator-cassandra-user-archive@incubator.apache.org Received: (qmail 31672 invoked by uid 500); 13 Mar 2010 13:36:45 -0000 Mailing-List: contact cassandra-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: cassandra-user@incubator.apache.org Delivered-To: mailing list cassandra-user@incubator.apache.org Received: (qmail 31664 invoked by uid 99); 13 Mar 2010 13:36:44 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 13 Mar 2010 13:36:44 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jbellis@gmail.com designates 74.125.82.47 as permitted sender) Received: from [74.125.82.47] (HELO mail-ww0-f47.google.com) (74.125.82.47) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 13 Mar 2010 13:36:38 +0000 Received: by wwg30 with SMTP id 30so1310557wwg.6 for ; Sat, 13 Mar 2010 05:36:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :from:date:message-id:subject:to:content-type :content-transfer-encoding; bh=a55p2R+jR6+PpKcgF8dLR/WpV7LNBGo7MJPoqVF9lPg=; b=BG01sHf+7CYhPqQdLEtwWF2PuPMeL1yia+fFtLa6M7TE6YBqbBqvUSauSKpV7WKfVf qWEOYx/LXPJnz+ImIMeAFcZaLS4yqhnKWQgBBalA8eiQFlfZWdDziLskROXjB1Tx5uKH F6v6yAqUWhfspoZekaHrCB9qF9L/E/n1CNOg8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; b=txp9VwTz5I0bmF5F9eWPCHPdfP5OkiPC1Xa1BKcmlvbL/EwBb+h0JUnZdxwr1z340X Vh4EfqbOLtCtBAzsqkYpBmwyu+k21RuNOhAdf9oYu1zDlGmyc9++IJSdR9HTBdY1fS/G NAdkO8lQY3fEQMRcfSfQbeto2aDudmonA3atY= MIME-Version: 1.0 Received: by 10.216.90.196 with SMTP id e46mr1666509wef.138.1268487376136; Sat, 13 Mar 2010 05:36:16 -0800 (PST) In-Reply-To: References: <3b5f72031003111934m75e9de7w4616642b5a14a9fe@mail.gmail.com> From: Jonathan Ellis Date: Sat, 13 Mar 2010 07:35:56 -0600 Message-ID: Subject: Re: question about deleting from cassandra To: cassandra-user@incubator.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org You should submit your minor change to jira for others who might want to tr= y it. On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li wrote: > Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked > perfectly. Without this feature, as far as you have high volume new and > expired columns your life will be miserable :-) > > Thanks for great job Sylvain!! > > -Weijun > > On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne > wrote: >> >> I guess you can also vote for this ticket : >> https://issues.apache.org/jira/browse/CASSANDRA-699 :) >> >> >> >> -- >> Sylvain >> >> >> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson wrote: >> > On 12 March 2010 03:34, Bill Au wrote: >> >> >> >> Let take Twitter as an example. =A0All the tweets are timestamped. = =A0I >> >> want >> >> to keep only a month's worth of tweets for each user. =A0The number o= f >> >> tweets >> >> that fit within this one month window varies from user to user. =A0Wh= at >> >> is the >> >> best way to accomplish this? >> > >> > This is the "expiry" problem that has been discussed on this list >> > before. As >> > far as I can see there are no easy ways to do it with 0.5 >> > >> > If you use the ordered partitioner and make the first part of the keys= a >> > timestamp (or part of it) then you can get the keys and delete them. >> > >> > However, these deletes will be quite inefficient, currently each row >> > must be >> > deleted individually (there was a patch to range delete kicking around= , >> > I >> > don't know if it's accepted yet) >> > >> > But even if range delete is implemented, it's still quite inefficient >> > and >> > not really what you want, and doesn't work with the RandomPartitioner >> > >> > If you have some metadata to say who tweeted within a given period (sa= y >> > 10 >> > days or 30 days) and you store the tweets all in the same key per user >> > per >> > period (say with one column per tweet, or use supercolumns), then you >> > can >> > just delete one key per user per period. >> > >> > One of the problems with using a time-based key with ordered partition= er >> > is >> > that you're always going to have a data imbalance, so you may want to >> > try >> > hashing *part* of the key (The first part) so you can still range scan >> > the >> > next part. This may fix load balancing while still enabling you to use >> > range >> > scans to do data expiry. >> > >> > e.g. your key is >> > >> > Hash of day number + user id + timestamp >> > >> > Then you can range scan the entire day's tweets to expire them, and >> > range >> > scan a given user's tweets for a given day efficiently (and doing this >> > for >> > 30 days is just 30 range scans) >> > >> > Putting a hash in there fixes load balancing with OPP. >> > >> > Mark >> > > >