Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of chris.shorrock@gmail.com
 designates 209.85.212.44 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:date:x-google-sender-auth:message-id:subject
         :from:to:content-type;
        b=HdHsK2Fre6XQIjtK9tNiFLJJEF2FVixPQPeU15K2SxfPsAeiMnSiZkJkRgUqvBXndw
         EyRCm5ScxyqSkj1xv6IEf5Pm/WtNy7D2w3Gausui4+MVDLewvSCDw8TvZ0/llqebMn7I
         4KviqPO0ku1nYbPfxnay1GWWt961NUFKijUN4=
MIME-Version: 1.0
Sender: chris.shorrock@gmail.com
Date: Fri, 16 Apr 2010 11:50:28 -0700
Message-ID: <m2s7fee68c51004161150j4bd462e6gbeaffa98e7908d63@mail.gmail.com>
Subject: effective modeling for fixed limit columns
From: Chris Shorrock <chris@shorrockin.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=e0cb4e8876a3bdb34304845f15b1

--e0cb4e8876a3bdb34304845f15b1
Content-Type: text/plain; charset=ISO-8859-1

I'm attempting to come up with a technique for limiting the number of
columns a single key (or super column - doesn't matter too much for the
context of this conversation) may contain at any one time. My actual
use-case is a little too meaty to try to describe so an alternate use-case
of this mechanism could be:

*Construct a twitter-esque feed which maintains a list N tweets. Tweets (in
this system - and in reality I suppose) occur at such a rate that you want
to limit a given users "feed" to N items. You do not have the ability to
store an infinite number of tweets due to the physical constraints of your
hardware.*


The "*my first idea*" answer is when a tweet is inserted into the the feed
of a given person, that you then do a count and delete of any outstanding
tweets. In reality you could first count, then (if count >= N) do a batch
mutate for the insertion of the new entry and the removal of the old. My
issue with this approach is that after a certain point every new entry into
the system will incur the removal of an old entry. The count, once a feed
has reached N will always be >= N on any subsequent queries. Depending on
how you index the tweets you may need to actually do a read instead of count
to get the row identifiers.

My second approach was to utilize a "slot" system where you have a record
stored somewhere that indicates the next slot for insertion. This can be
thought of as a fixed length array where you store the next insertion point
in some other column family. When a new tweet occurs you retrieve the
current "slot" meta-data, insert into that index, then update the meta-data
for the next insertion. My concerns with this relate around synchronization
and losing entries due to concurrent operations. I'd rather not have to
something like ZooKeeper to synchronize in the application cluster.

I have some other ideas but I'm mostly just spit-balling at this point. So I
thought I'd reach out the collective intelligence of the group to see if
anyone has implemented something similar. Thanks in advance.

--e0cb4e8876a3bdb34304845f15b1
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I&#39;m attempting to come up with a technique for limiting the number of c=
olumns a single key (or super column - doesn&#39;t matter too much for the =
context of this conversation) may contain at any one time. My actual use-ca=
se is a little too meaty to try to describe so an alternate use-case of thi=
s mechanism could be:<div>
<br></div><blockquote class=3D"webkit-indent-blockquote" style=3D"margin: 0=
 0 0 40px; border: none; padding: 0px;"><div><i>Construct a twitter-esque f=
eed which maintains a list N tweets. Tweets (in this system - and in realit=
y I suppose) occur at such a rate that you want to limit a given users &quo=
t;feed&quot; to N items. You do not have the ability to store an infinite n=
umber of tweets due to the physical constraints of your hardware.</i></div>
</blockquote><div><br></div><div>The &quot;<i>my first idea</i>&quot; answe=
r is when a tweet is inserted into the the feed of a given person, that you=
 then do a count and delete of any outstanding tweets. In reality you could=
 first count, then (if count &gt;=3D N) do a batch mutate for the insertion=
 of the new entry and the removal of the old. My issue with this approach i=
s that after a certain point every new entry into the system will incur the=
 removal of an old entry. The count, once a feed has reached N will always =
be &gt;=3D N on any subsequent queries. Depending on how you index the twee=
ts you may need to actually do a read instead of count to get the row ident=
ifiers.</div>
<div><br></div><div>My second approach was to utilize a &quot;slot&quot; sy=
stem where you have a record stored somewhere that indicates the next slot =
for insertion. This can be thought of as a fixed length array where you sto=
re the next insertion point in some other column family. When a new tweet o=
ccurs you retrieve the current &quot;slot&quot; meta-data, insert into that=
 index, then update the meta-data for the next insertion. My concerns with =
this relate around synchronization and losing entries due to concurrent ope=
rations. I&#39;d rather not have to something like ZooKeeper to synchronize=
 in the application cluster.</div>
<div><br></div><div>I have some other ideas but I&#39;m mostly just spit-ba=
lling at this point. So I thought I&#39;d reach out the collective intellig=
ence of the group to see if anyone has implemented something similar. Thank=
s in advance.</div>

--e0cb4e8876a3bdb34304845f15b1--