incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Gallamore <mike.e.gallam...@googlemail.com>
Subject Re: expiring data out of Cassandra/time to live
Date Wed, 31 Mar 2010 20:05:29 GMT
Thanks a lot Jonathan and everyone else that replied to my thread. This 
looks like it will do what I need. I have a colleague that is a Java 
wizard and will probably have no problem putting this patch into place 
for our production builds.

I'm a C/C++ programmer at heart so the code itself doesn't scare me, 
just my lack of java nuances lead me not to want to try adding this myself.
On 03/31/2010 11:46 AM, Jonathan Ellis wrote:
> Sounds like you want to follow
> https://issues.apache.org/jira/browse/CASSANDRA-699.  There is a patch
> there but I wouldn't recommend merging it if Java scares you. :)
>
> On Wed, Mar 31, 2010 at 1:39 PM, Mike Gallamore
> <mike.e.gallamore@googlemail.com>  wrote:
>    
>> Hello everyone,
>>
>> I saw a thread on the incubator user chat that started a few months ago:
>> http://www.mail-archive.com/cassandra-user@incubator.apache.org/msg02047.html
>> . It looks like this is the new official user mailing list so I'll add my
>> thoughts/question here.
>>
>> Is there any way to set a TTL on data stored in Cassandra? Deleting old
>> SSTables isn't enough for my needs. I need the data to go away after a fixed
>> period of time. Here is what I'm trying to do and my reasoning why I think
>> Cassandra and not something like Flare/Memcache mets my need:
>>
>> I'm building a reputation system. We get lots of data at my work (in the
>> 10's of GB of reputation data a day). The trick is that old data is not
>> useful as a senders ip address might have changed, they might have had a bot
>> on their system and no have removed it, etc. So I need to be able to keep
>> data for a fixed period of time and then afterwords it isn't needed/ideally
>> would be GC'd out.
>>
>> We want to do one thing if we either never heard of the individual or at
>> least not since the expiry time, and another thing based on the reputation
>> data that is stored in Cassandra if it is current. So ideally a Cassandra
>> call for a key for someone who's reputation is expired would return nothing
>> and we'd reply with our default reputation for that individual. There really
>> is no point using network bandwidth to return all the fields associated with
>> that key only to look at a timestamp and end up ignoring it anyways.
>> Similarly the latency of requesting first the timestamp and then the data in
>> two separate requests is prohibitive.
>>
>> Why Cassandra:
>>
>> Our data is complex and is hard to handle completely in a key/value sense.
>> In the past we were doing this and just encoding the complex structure
>> inside of JSON but this isn't ideal. It is very nice algorithmically to be
>> able to say: give me this column, or update this element of this hash etc,
>> rather than having to pull the old version, decode, modify, re-encode and
>> push back to a cache based system.
>> Our data is large (in the low TB's at the moment, but expected to grow to
>> 50-100TB of live data)
>> Need quick response for both searches and writes: typically for each thing
>> we track we get a request for the reputation, the message gets processed and
>> then we get feedback back from the recipient. So reads and writes are
>> symmetric.
>> High request rate: millions per hour
>> hundreds of millions of unique reputations (this is way crawling though the
>> data with a script purging old data doesn't make sense)
>> Availablity/load balancing a must. Data needs to be replicated a disk copy
>> is useful so if we have a power outage we don't lose the system.
>> It would be interesting to keep a local subset of our data at customers
>> sites and have them "replicate up" there data rather than send there
>> feedback in a different manner that then has to be processed and pumped into
>> our datastore (hopefully this is possible with Cassandra with some creative
>> choices of how the data is hashed between nodes)
>>
>> Does the capability to set an expiry time exist? If not is there any plans
>> to add it? My java experience is very limited (I'm accessing Cassandra via
>> thrift/Perl) so it isn't something I'd be able to jump in and run with
>> myself.
>>
>>      


Mime
View raw message