incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Ellis <jbel...@gmail.com>
Subject Re: expiring data out of Cassandra/time to live
Date Wed, 31 Mar 2010 18:46:30 GMT
Sounds like you want to follow
https://issues.apache.org/jira/browse/CASSANDRA-699.  There is a patch
there but I wouldn't recommend merging it if Java scares you. :)

On Wed, Mar 31, 2010 at 1:39 PM, Mike Gallamore
<mike.e.gallamore@googlemail.com> wrote:
> Hello everyone,
>
> I saw a thread on the incubator user chat that started a few months ago:
> http://www.mail-archive.com/cassandra-user@incubator.apache.org/msg02047.html
> . It looks like this is the new official user mailing list so I'll add my
> thoughts/question here.
>
> Is there any way to set a TTL on data stored in Cassandra? Deleting old
> SSTables isn't enough for my needs. I need the data to go away after a fixed
> period of time. Here is what I'm trying to do and my reasoning why I think
> Cassandra and not something like Flare/Memcache mets my need:
>
> I'm building a reputation system. We get lots of data at my work (in the
> 10's of GB of reputation data a day). The trick is that old data is not
> useful as a senders ip address might have changed, they might have had a bot
> on their system and no have removed it, etc. So I need to be able to keep
> data for a fixed period of time and then afterwords it isn't needed/ideally
> would be GC'd out.
>
> We want to do one thing if we either never heard of the individual or at
> least not since the expiry time, and another thing based on the reputation
> data that is stored in Cassandra if it is current. So ideally a Cassandra
> call for a key for someone who's reputation is expired would return nothing
> and we'd reply with our default reputation for that individual. There really
> is no point using network bandwidth to return all the fields associated with
> that key only to look at a timestamp and end up ignoring it anyways.
> Similarly the latency of requesting first the timestamp and then the data in
> two separate requests is prohibitive.
>
> Why Cassandra:
>
> Our data is complex and is hard to handle completely in a key/value sense.
> In the past we were doing this and just encoding the complex structure
> inside of JSON but this isn't ideal. It is very nice algorithmically to be
> able to say: give me this column, or update this element of this hash etc,
> rather than having to pull the old version, decode, modify, re-encode and
> push back to a cache based system.
> Our data is large (in the low TB's at the moment, but expected to grow to
> 50-100TB of live data)
> Need quick response for both searches and writes: typically for each thing
> we track we get a request for the reputation, the message gets processed and
> then we get feedback back from the recipient. So reads and writes are
> symmetric.
> High request rate: millions per hour
> hundreds of millions of unique reputations (this is way crawling though the
> data with a script purging old data doesn't make sense)
> Availablity/load balancing a must. Data needs to be replicated a disk copy
> is useful so if we have a power outage we don't lose the system.
> It would be interesting to keep a local subset of our data at customers
> sites and have them "replicate up" there data rather than send there
> feedback in a different manner that then has to be processed and pumped into
> our datastore (hopefully this is possible with Cassandra with some creative
> choices of how the data is hashed between nodes)
>
> Does the capability to set an expiry time exist? If not is there any plans
> to add it? My java experience is very limited (I'm accessing Cassandra via
> thrift/Perl) so it isn't something I'd be able to jump in and run with
> myself.
>

Mime
View raw message