incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Daum <r...@thimbleware.com>
Subject Re: expiring data out of Cassandra/time to live
Date Wed, 31 Mar 2010 18:49:23 GMT
I was able to successfully merge this patch into the 0.6 branch a few weeks
ago by doing the following:


   - Downloading the patch
   - Checking out the trunk of Cassandra from github
   - Rolling back (checking out) the git repo to the same date that the
   patch was submitted to Jira
   - Applying the patch
   - Committing to Git
   - Merging forward to the 0.6 branch
   - Resolve one or two minor conflicts.


R

On Wed, Mar 31, 2010 at 2:46 PM, Jonathan Ellis <jbellis@gmail.com> wrote:

> Sounds like you want to follow
> https://issues.apache.org/jira/browse/CASSANDRA-699.  There is a patch
> there but I wouldn't recommend merging it if Java scares you. :)
>
> On Wed, Mar 31, 2010 at 1:39 PM, Mike Gallamore
> <mike.e.gallamore@googlemail.com> wrote:
> > Hello everyone,
> >
> > I saw a thread on the incubator user chat that started a few months ago:
> >
> http://www.mail-archive.com/cassandra-user@incubator.apache.org/msg02047.html
> > . It looks like this is the new official user mailing list so I'll add my
> > thoughts/question here.
> >
> > Is there any way to set a TTL on data stored in Cassandra? Deleting old
> > SSTables isn't enough for my needs. I need the data to go away after a
> fixed
> > period of time. Here is what I'm trying to do and my reasoning why I
> think
> > Cassandra and not something like Flare/Memcache mets my need:
> >
> > I'm building a reputation system. We get lots of data at my work (in the
> > 10's of GB of reputation data a day). The trick is that old data is not
> > useful as a senders ip address might have changed, they might have had a
> bot
> > on their system and no have removed it, etc. So I need to be able to keep
> > data for a fixed period of time and then afterwords it isn't
> needed/ideally
> > would be GC'd out.
> >
> > We want to do one thing if we either never heard of the individual or at
> > least not since the expiry time, and another thing based on the
> reputation
> > data that is stored in Cassandra if it is current. So ideally a Cassandra
> > call for a key for someone who's reputation is expired would return
> nothing
> > and we'd reply with our default reputation for that individual. There
> really
> > is no point using network bandwidth to return all the fields associated
> with
> > that key only to look at a timestamp and end up ignoring it anyways.
> > Similarly the latency of requesting first the timestamp and then the data
> in
> > two separate requests is prohibitive.
> >
> > Why Cassandra:
> >
> > Our data is complex and is hard to handle completely in a key/value
> sense.
> > In the past we were doing this and just encoding the complex structure
> > inside of JSON but this isn't ideal. It is very nice algorithmically to
> be
> > able to say: give me this column, or update this element of this hash
> etc,
> > rather than having to pull the old version, decode, modify, re-encode and
> > push back to a cache based system.
> > Our data is large (in the low TB's at the moment, but expected to grow to
> > 50-100TB of live data)
> > Need quick response for both searches and writes: typically for each
> thing
> > we track we get a request for the reputation, the message gets processed
> and
> > then we get feedback back from the recipient. So reads and writes are
> > symmetric.
> > High request rate: millions per hour
> > hundreds of millions of unique reputations (this is way crawling though
> the
> > data with a script purging old data doesn't make sense)
> > Availablity/load balancing a must. Data needs to be replicated a disk
> copy
> > is useful so if we have a power outage we don't lose the system.
> > It would be interesting to keep a local subset of our data at customers
> > sites and have them "replicate up" there data rather than send there
> > feedback in a different manner that then has to be processed and pumped
> into
> > our datastore (hopefully this is possible with Cassandra with some
> creative
> > choices of how the data is hashed between nodes)
> >
> > Does the capability to set an expiry time exist? If not is there any
> plans
> > to add it? My java experience is very limited (I'm accessing Cassandra
> via
> > thrift/Perl) so it isn't something I'd be able to jump in and run with
> > myself.
> >
>

Mime
View raw message