incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: Partition maintenance
Date Wed, 19 Dec 2012 21:04:12 GMT
Couple of approaches to exporting…

1) If you know the list of keys you want to export, you could use / modify the sstable2json
tool and pass in the list of keys. If expiring columns are used remove the expiration later
or modify sstable2json to not include it. 

2) If the list of keys to too big but you can parse a key to determine if it should be exported,
it would be possible to modify sstable2json use a regex or similar to select the keys which
match. Or to include some business logic. 

Both of these would require reading all sstables. Even though the maximum timestamp in an
sstable is stored on disk it would be of no use. In the server we work out the min and max
keys in each sstable, so it's possible to be a little smarter about this by reading the -Index.db
file.

If you are monkeying around with sstable export it could be changed to your preferred output
format. 

3) Use Hadoop + Hive. 


To purge the data from Cassandra…

Use TTL, a low gc_grace_seconds and compaction. You can specify the files you want to compact
via JMX  so this could be added to nodetool. It may be necessary to use some smarts to work
out which sstables to compact, the maxTimestamp in the sstable header will help here. 

Note that columns are not purged unless all fragments from the row are included in the compaction.
 This could be a problem. It probably depends on your workload though. 

Hope that helps. 
 

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 19/12/2012, at 6:37 AM, Michael Kjellman <mkjellman@barracuda.com> wrote:

> Yeah. No JOINs as of now in Cassandra.
> 
> What if you dumped the CF in question once a month to json and rewrote out each record
in the json data if it met the time stamp you were interested in archiving.
> 
> You could then bulk load each "month" back in if you had to restore. 
> 
> Doesn't help with deletes though and I would advise against large mass delete operations
each month -- tends to lead to a very unhappy cluster 
> 
> On Dec 18, 2012, at 9:23 AM, "Stephen.M.Thompson@wellsfargo.com" <Stephen.M.Thompson@wellsfargo.com>
wrote:
> 
>> Michael - That is one approach I have considered, but that also makes querying the
system particularly onerous since every column family would require its own query – I don’t
think there is any good way to “join” those, right?
>>  
>> Chris – that is an interesting concept, but as Viktor and Keith note, it seems
to have problems. 
>>  
>> Could we do this simply by mass deletes?  For example, if I created a column which
was just YYYY/MM, then during our maintenance we could spool off records that match the month
we are archiving, then do a bulk delete by that key.  We would need to have a secondary index
for that, I would assume.
>>  
>>  
>> From: Michael Kjellman [mailto:mkjellman@barracuda.com] 
>> Sent: Tuesday, December 18, 2012 11:15 AM
>> To: user@cassandra.apache.org
>> Subject: Re: Partition maintenance
>>  
>> You could make a column family for each period of time and then drop the column family
when you want to destroy it. Before you drop it you could use the sstabletojson converter
and write the json files out to tape.
>>  
>> Might make your life difficult however if you need an input split for map reduce
between each time period because you would be limited to working on one column family at a
time. 
>> 
>> On Dec 18, 2012, at 8:09 AM, "Stephen.M.Thompson@wellsfargo.com" <Stephen.M.Thompson@wellsfargo.com>
wrote:
>> 
>> Hi folks.  Still working through the details of building out a Cassandra solution
and I have an interesting requirement that I’m not sure how to implement in Cassandra:
>>  
>> In our current Oracle world, we have the data for this system partitioned by month,
and each month the data that are now 18-months old are archived to tape/cold storage and then
the partition for that month is dropped.  Is there a way to do something similar with Cassandra
without destroying our overall performance?
>>  
>> Thanks in advance,
>> Steve
>>  
>> ---------------------------------- 
>> Join Barracuda Networks in the fight against hunger.
>> To learn how you can help in your community, please visit: http://on.fb.me/UAdL4f
>>   ­­  
> 
> ---------------------------------- 
> Join Barracuda Networks in the fight against hunger.
> To learn how you can help in your community, please visit: http://on.fb.me/UAdL4f
>   ­­  


Mime
View raw message