incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Gallamore <mike.e.gallam...@googlemail.com>
Subject expiring data out of Cassandra/time to live
Date Wed, 31 Mar 2010 18:39:28 GMT
Hello everyone,

I saw a thread on the incubator user chat that started a few months ago: 
http://www.mail-archive.com/cassandra-user@incubator.apache.org/msg02047.html 
. It looks like this is the new official user mailing list so I'll add 
my thoughts/question here.

Is there any way to set a TTL on data stored in Cassandra? Deleting old 
SSTables isn't enough for my needs. I need the data to go away after a 
fixed period of time. Here is what I'm trying to do and my reasoning why 
I think Cassandra and not something like Flare/Memcache mets my need:

I'm building a reputation system. We get lots of data at my work (in the 
10's of GB of reputation data a day). The trick is that old data is not 
useful as a senders ip address might have changed, they might have had a 
bot on their system and no have removed it, etc. So I need to be able to 
keep data for a fixed period of time and then afterwords it isn't 
needed/ideally would be GC'd out.

We want to do one thing if we either never heard of the individual or at 
least not since the expiry time, and another thing based on the 
reputation data that is stored in Cassandra if it is current. So ideally 
a Cassandra call for a key for someone who's reputation is expired would 
return nothing and we'd reply with our default reputation for that 
individual. There really is no point using network bandwidth to return 
all the fields associated with that key only to look at a timestamp and 
end up ignoring it anyways. Similarly the latency of requesting first 
the timestamp and then the data in two separate requests is prohibitive.

Why Cassandra:

    * Our data is complex and is hard to handle completely in a
      key/value sense. In the past we were doing this and just encoding
      the complex structure inside of JSON but this isn't ideal. It is
      very nice algorithmically to be able to say: give me this column,
      or update this element of this hash etc, rather than having to
      pull the old version, decode, modify, re-encode and push back to a
      cache based system.
    * Our data is large (in the low TB's at the moment, but expected to
      grow to 50-100TB of live data)
    * Need quick response for both searches and writes: typically for
      each thing we track we get a request for the reputation, the
      message gets processed and then we get feedback back from the
      recipient. So reads and writes are symmetric.
    * High request rate: millions per hour
    * hundreds of millions of unique reputations (this is way crawling
      though the data with a script purging old data doesn't make sense)
    * Availablity/load balancing a must. Data needs to be replicated a
      disk copy is useful so if we have a power outage we don't lose the
      system.
    * It would be interesting to keep a local subset of our data at
      customers sites and have them "replicate up" there data rather
      than send there feedback in a different manner that then has to be
      processed and pumped into our datastore (hopefully this is
      possible with Cassandra with some creative choices of how the data
      is hashed between nodes)

Does the capability to set an expiry time exist? If not is there any 
plans to add it? My java experience is very limited (I'm accessing 
Cassandra via thrift/Perl) so it isn't something I'd be able to jump in 
and run with myself.

Mime
View raw message