Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 59509 invoked from network); 16 Apr 2010 18:50:56 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 16 Apr 2010 18:50:56 -0000 Received: (qmail 56423 invoked by uid 500); 16 Apr 2010 18:50:55 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 56397 invoked by uid 500); 16 Apr 2010 18:50:55 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 56389 invoked by uid 99); 16 Apr 2010 18:50:55 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Apr 2010 18:50:55 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of chris.shorrock@gmail.com designates 209.85.212.44 as permitted sender) Received: from [209.85.212.44] (HELO mail-vw0-f44.google.com) (209.85.212.44) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Apr 2010 18:50:49 +0000 Received: by vws11 with SMTP id 11so1362741vws.31 for ; Fri, 16 Apr 2010 11:50:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:date :x-google-sender-auth:received:message-id:subject:from:to :content-type; bh=941pakaQchkFPdfu0IfbQBX1iBEhpJdIIwUHdlK2Q8Y=; b=I9dCdU6ER+Ez/h6DWYkwK8zbCWn5kIbZhMqziaZ0YQ1Z252ufzMxd74V//vyuNN1nE svvH+EWF0vBkpo4IsRBQZEWP5xuafNERLftg2PBmVHCOZhNunWnkWwh//ii8RPNSJ6TU skrVY+G+CdYEN01C2i9obvt4UxmMF7R0xR9mQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:content-type; b=HdHsK2Fre6XQIjtK9tNiFLJJEF2FVixPQPeU15K2SxfPsAeiMnSiZkJkRgUqvBXndw EyRCm5ScxyqSkj1xv6IEf5Pm/WtNy7D2w3Gausui4+MVDLewvSCDw8TvZ0/llqebMn7I 4KviqPO0ku1nYbPfxnay1GWWt961NUFKijUN4= MIME-Version: 1.0 Sender: chris.shorrock@gmail.com Received: by 10.220.98.16 with HTTP; Fri, 16 Apr 2010 11:50:28 -0700 (PDT) Date: Fri, 16 Apr 2010 11:50:28 -0700 X-Google-Sender-Auth: 5f30d743fff4eb27 Received: by 10.220.158.12 with SMTP id d12mr1270536vcx.84.1271443828355; Fri, 16 Apr 2010 11:50:28 -0700 (PDT) Message-ID: Subject: effective modeling for fixed limit columns From: Chris Shorrock To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=e0cb4e8876a3bdb34304845f15b1 --e0cb4e8876a3bdb34304845f15b1 Content-Type: text/plain; charset=ISO-8859-1 I'm attempting to come up with a technique for limiting the number of columns a single key (or super column - doesn't matter too much for the context of this conversation) may contain at any one time. My actual use-case is a little too meaty to try to describe so an alternate use-case of this mechanism could be: *Construct a twitter-esque feed which maintains a list N tweets. Tweets (in this system - and in reality I suppose) occur at such a rate that you want to limit a given users "feed" to N items. You do not have the ability to store an infinite number of tweets due to the physical constraints of your hardware.* The "*my first idea*" answer is when a tweet is inserted into the the feed of a given person, that you then do a count and delete of any outstanding tweets. In reality you could first count, then (if count >= N) do a batch mutate for the insertion of the new entry and the removal of the old. My issue with this approach is that after a certain point every new entry into the system will incur the removal of an old entry. The count, once a feed has reached N will always be >= N on any subsequent queries. Depending on how you index the tweets you may need to actually do a read instead of count to get the row identifiers. My second approach was to utilize a "slot" system where you have a record stored somewhere that indicates the next slot for insertion. This can be thought of as a fixed length array where you store the next insertion point in some other column family. When a new tweet occurs you retrieve the current "slot" meta-data, insert into that index, then update the meta-data for the next insertion. My concerns with this relate around synchronization and losing entries due to concurrent operations. I'd rather not have to something like ZooKeeper to synchronize in the application cluster. I have some other ideas but I'm mostly just spit-balling at this point. So I thought I'd reach out the collective intelligence of the group to see if anyone has implemented something similar. Thanks in advance. --e0cb4e8876a3bdb34304845f15b1 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I'm attempting to come up with a technique for limiting the number of c= olumns a single key (or super column - doesn't matter too much for the = context of this conversation) may contain at any one time. My actual use-ca= se is a little too meaty to try to describe so an alternate use-case of thi= s mechanism could be:

Construct a twitter-esque f= eed which maintains a list N tweets. Tweets (in this system - and in realit= y I suppose) occur at such a rate that you want to limit a given users &quo= t;feed" to N items. You do not have the ability to store an infinite n= umber of tweets due to the physical constraints of your hardware.

The "my first idea" answe= r is when a tweet is inserted into the the feed of a given person, that you= then do a count and delete of any outstanding tweets. In reality you could= first count, then (if count >=3D N) do a batch mutate for the insertion= of the new entry and the removal of the old. My issue with this approach i= s that after a certain point every new entry into the system will incur the= removal of an old entry. The count, once a feed has reached N will always = be >=3D N on any subsequent queries. Depending on how you index the twee= ts you may need to actually do a read instead of count to get the row ident= ifiers.

My second approach was to utilize a "slot" sy= stem where you have a record stored somewhere that indicates the next slot = for insertion. This can be thought of as a fixed length array where you sto= re the next insertion point in some other column family. When a new tweet o= ccurs you retrieve the current "slot" meta-data, insert into that= index, then update the meta-data for the next insertion. My concerns with = this relate around synchronization and losing entries due to concurrent ope= rations. I'd rather not have to something like ZooKeeper to synchronize= in the application cluster.

I have some other ideas but I'm mostly just spit-ba= lling at this point. So I thought I'd reach out the collective intellig= ence of the group to see if anyone has implemented something similar. Thank= s in advance.
--e0cb4e8876a3bdb34304845f15b1--