cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject Finding records that exist on Cassandra but not externally
Date Wed, 07 Sep 2016 07:47:00 GMT
First off I hope this appropriate here- I couldn't decide whether this was a question for Cassandra
users or spark users so if you think it's in the wiring place feel free to redirect me.

I have a system that does a load of data manipulation using spark.  The output of this program
is a effectively the new state that I want my Cassandra table to be in and the final step
is to update Cassandra so that it matches this state.

At present I'm currently inserting all rows in my generated state into Cassandra. This works
for new rows and also for updating existing rows but doesn't of course delete any rows that
were already in Cassandra but not in my new state. 
The problem I have now is how best to delete these missing rows. Options I have considered

1. Setting a ttl on inserts which is roughly the same as my data refresh period. This would
probably be pretty performant but I really don't want to do this because it would mean that
all data in my database would disappear if I had issues running my refresh task!

2. Every time I refresh the data I would first have to fetch all primary keys from Cassandra
and, compare them to primary keys locally to create a list of pks to delete before the insert.
This seems the most logicaly correct option but is going to result in reading vast amounts
of data from Cassandra.

3. Truncating the entire table before refreshing Cassandra. This has the benefit of being
pretty simple in code but I'm not sure of the performance implications of this and what will
happen if I truncate while a node is offline.

For reference the table is on the order of 10s of millions of rows and for any data refresh
only a very small fraction (<.1%) will actually need deleting. 99% of the time I'll just
be overwriting existing keys. 

I'd be grateful if anyone could shed some advice on the best solution here or whether there's
some better way I haven't thought of.



View raw message