cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jens Rantil <jens.ran...@tink.se>
Subject Re: Finding records that exist on Cassandra but not externally
Date Thu, 08 Sep 2016 11:32:34 GMT
Hi again Chris,

Another option would be to have a look at using a Merkle Tree to quickly
drill down to the differences. This is actually what Cassandra uses
internally when running a repair between different nodes.

Cheers,
Jens

On Wed, Sep 7, 2016 at 9:47 AM <chris@cmartinit.co.uk> wrote:

> First off I hope this appropriate here- I couldn't decide whether this was
> a question for Cassandra users or spark users so if you think it's in the
> wiring place feel free to redirect me.
>
> I have a system that does a load of data manipulation using spark.  The
> output of this program is a effectively the new state that I want my
> Cassandra table to be in and the final step is to update Cassandra so that
> it matches this state.
>
> At present I'm currently inserting all rows in my generated state into
> Cassandra. This works for new rows and also for updating existing rows but
> doesn't of course delete any rows that were already in Cassandra but not in
> my new state.
>
> The problem I have now is how best to delete these missing rows. Options I
> have considered are:
>
> 1. Setting a ttl on inserts which is roughly the same as my data refresh
> period. This would probably be pretty performant but I really don't want to
> do this because it would mean that all data in my database would disappear
> if I had issues running my refresh task!
>
> 2. Every time I refresh the data I would first have to fetch all primary
> keys from Cassandra and, compare them to primary keys locally to create a
> list of pks to delete before the insert. This seems the most logicaly
> correct option but is going to result in reading vast amounts of data from
> Cassandra.
>
> 3. Truncating the entire table before refreshing Cassandra. This has the
> benefit of being pretty simple in code but I'm not sure of the performance
> implications of this and what will happen if I truncate while a node is
> offline.
>
> For reference the table is on the order of 10s of millions of rows and for
> any data refresh only a very small fraction (<.1%) will actually need
> deleting. 99% of the time I'll just be overwriting existing keys.
>
> I'd be grateful if anyone could shed some advice on the best solution here
> or whether there's some better way I haven't thought of.
>
> Thanks,
>
> Chris
>
-- 

Jens Rantil
Backend Developer @ Tink

Tink AB, Wallingatan 5, 111 60 Stockholm, Sweden
For urgent matters you can reach me at +46-708-84 18 32.

Mime
View raw message