cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcel Steinbach <>
Subject Get all keys from the cluster
Date Sat, 21 Jan 2012 09:45:49 GMT
We're running a 8 node cluster with different CFs for different applications. One of the application
uses 1.5TB out of 1.8TB in total, but only because we started out with a deletion mechanism
and implemented one later on. So there is probably a high amount of old data in there, that
we don't even use anymore. 

Now we want to delete that data. To know, which rows we may delete, we have to lookup a SQL
database. If the key is not in there anymore, we may delete that row in cassandra, too. 

This basically means, we have to iterate over all the rows in that CF. This kind of begs for
hadoop, but that seems not to be an option, currently. I tried.

So we figured, we could run over the sstables files (maybe only the index), check the keys
in the mysql, and later run the deletes on the cluster. This way, we could iterate on each
node in parallel. 

Does that sound reasonable? Any pros/cons, maybe a "killer" argument to use hadoop for that?

View raw message