incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Kluesing ...@bluekai.com>
Subject RE: expiring data out of Cassandra/time to live
Date Wed, 31 Mar 2010 19:43:05 GMT
We also applied this patch to the 0.6 branch and have been running it for a bit over a week.
Works well, would love to see it get into trunk/0.7 proper.

From: Ryan Daum [mailto:ryan@thimbleware.com]
Sent: Wednesday, March 31, 2010 11:49 AM
To: user@cassandra.apache.org
Subject: Re: expiring data out of Cassandra/time to live

I was able to successfully merge this patch into the 0.6 branch a few weeks ago by doing the
following:


 *   Downloading the patch
 *   Checking out the trunk of Cassandra from github
 *   Rolling back (checking out) the git repo to the same date that the patch was submitted
to Jira
 *   Applying the patch
 *   Committing to Git
 *   Merging forward to the 0.6 branch
 *   Resolve one or two minor conflicts.

R

On Wed, Mar 31, 2010 at 2:46 PM, Jonathan Ellis <jbellis@gmail.com<mailto:jbellis@gmail.com>>
wrote:
Sounds like you want to follow
https://issues.apache.org/jira/browse/CASSANDRA-699.  There is a patch
there but I wouldn't recommend merging it if Java scares you. :)

On Wed, Mar 31, 2010 at 1:39 PM, Mike Gallamore
<mike.e.gallamore@googlemail.com<mailto:mike.e.gallamore@googlemail.com>> wrote:
> Hello everyone,
>
> I saw a thread on the incubator user chat that started a few months ago:
> http://www.mail-archive.com/cassandra-user@incubator.apache.org/msg02047.html
> . It looks like this is the new official user mailing list so I'll add my
> thoughts/question here.
>
> Is there any way to set a TTL on data stored in Cassandra? Deleting old
> SSTables isn't enough for my needs. I need the data to go away after a fixed
> period of time. Here is what I'm trying to do and my reasoning why I think
> Cassandra and not something like Flare/Memcache mets my need:
>
> I'm building a reputation system. We get lots of data at my work (in the
> 10's of GB of reputation data a day). The trick is that old data is not
> useful as a senders ip address might have changed, they might have had a bot
> on their system and no have removed it, etc. So I need to be able to keep
> data for a fixed period of time and then afterwords it isn't needed/ideally
> would be GC'd out.
>
> We want to do one thing if we either never heard of the individual or at
> least not since the expiry time, and another thing based on the reputation
> data that is stored in Cassandra if it is current. So ideally a Cassandra
> call for a key for someone who's reputation is expired would return nothing
> and we'd reply with our default reputation for that individual. There really
> is no point using network bandwidth to return all the fields associated with
> that key only to look at a timestamp and end up ignoring it anyways.
> Similarly the latency of requesting first the timestamp and then the data in
> two separate requests is prohibitive.
>
> Why Cassandra:
>
> Our data is complex and is hard to handle completely in a key/value sense.
> In the past we were doing this and just encoding the complex structure
> inside of JSON but this isn't ideal. It is very nice algorithmically to be
> able to say: give me this column, or update this element of this hash etc,
> rather than having to pull the old version, decode, modify, re-encode and
> push back to a cache based system.
> Our data is large (in the low TB's at the moment, but expected to grow to
> 50-100TB of live data)
> Need quick response for both searches and writes: typically for each thing
> we track we get a request for the reputation, the message gets processed and
> then we get feedback back from the recipient. So reads and writes are
> symmetric.
> High request rate: millions per hour
> hundreds of millions of unique reputations (this is way crawling though the
> data with a script purging old data doesn't make sense)
> Availablity/load balancing a must. Data needs to be replicated a disk copy
> is useful so if we have a power outage we don't lose the system.
> It would be interesting to keep a local subset of our data at customers
> sites and have them "replicate up" there data rather than send there
> feedback in a different manner that then has to be processed and pumped into
> our datastore (hopefully this is possible with Cassandra with some creative
> choices of how the data is hashed between nodes)
>
> Does the capability to set an expiry time exist? If not is there any plans
> to add it? My java experience is very limited (I'm accessing Cassandra via
> thrift/Perl) so it isn't something I'd be able to jump in and run with
> myself.
>


Mime
View raw message