incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxim Potekhin <potek...@bnl.gov>
Subject Re: Mass deletion -- slowing down
Date Mon, 14 Nov 2011 00:55:03 GMT
Thanks to all for valuable insight!

Two comments:
a) this is not actually time series data, but yes, each item has
a timestamp and thus chronological attribution.

b) so, what do you practically recommend? I need to delete
half a million to a million entries daily, then insert fresh data.
What's the right operation procedure?

For some reason I can still select on the index in the CLI, it's
the Pycassa module that gives me trouble, but I need it as this
is my platform and we are a Python shop.

Maxim



On 11/13/2011 7:22 PM, Peter Schuller wrote:
> Deletions in Cassandra imply the use of tombstones (see
> http://wiki.apache.org/cassandra/DistributedDeletes) and under some
> circumstances reads can turn O(n) with respect to the amount of
> columns deleted, depending. It sounds like this is what you're seeing.
>
> For example, suppose you're inserting a range of columns into a row,
> deleting it, and inserting another non-overlapping subsequent range.
> Repeat that a bunch of times. In terms of what's stored in Cassandra
> for the row you now have:
>
>    tomb
>    tomb
>    tomb
>    tomb
>    ....
>     actual data
>
> If you then do something like a slice on that row with the end-points
> being such that they include all the tombstones, Cassandra essentially
> has to read through and process all those tombstones (for the
> PostgreSQL aware: this is similar to the effect you can get if
> implementing e.g. a FIFO queue, where MIN(pos) turns O(n) with respect
> to the number of deleted entries until the last vacuum - improved in
> modern versions)).
>
>


Mime
View raw message