cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxim Potekhin <>
Subject Re: Mass deletion -- slowing down
Date Mon, 14 Nov 2011 21:09:21 GMT
Thanks for the note. Ideally I would not like to keep track of what is 
the oldest indexed date,
because this means that I'm already creating a bit of infrastructure on 
top of my database,
with attendant referential integrity problems.

But I suppose I'll be forced to do that. In addition, I'll have to wait 
until the grace period is over and compact,
removing the tombstones and finally clearing the disk (which is what I 
need to do in the first place).

Frankly, this whole situation for me illustrates a very real deficiency 
in Cassandra -- one would think that
deleting less than one percent of data shouldn't really lead to complete 
failures in certain indexed queries.
That's bad.


On 11/14/2011 3:01 AM, Guy Incognito wrote:
> i think what he means you know what day the 'oldest' day is?  
> eg if you have a rolling window of say 2 weeks, structure your query 
> so that your slice range only goes back 2 weeks, rather than to the 
> beginning of time.  this would avoid iterating over all the tombstones 
> from prior to the 2 week window.  this wouldn't work if you are 
> deleting arbitrary days in the middle of your date range.
> On 14/11/2011 02:02, Maxim Potekhin wrote:
>> Thanks Peter,
>> I'm not sure I entirely follow. By the oldest data, do you mean the
>> primary key corresponding to the limit of the time horizon? 
>> Unfortunately,
>> unique IDs and the timstamps do not correlate in the sense that 
>> chronologically
>> "newer" entries might have a smaller sequential ID. That's because 
>> the timestamp
>> corresponds to the last update that's stochastic in the sense that 
>> the jobs can take
>> from seconds to days to complete. As I said I'm not sure I understood 
>> you
>> correctly.
>> Also, I note that queries on different dates (i.e. not "contaminated" 
>> with lots
>> of tombstones) work just fine, which is consistent with the picture that
>> emerged so far.
>> Theoretically -- would compaction or cleanup help?
>> Thanks
>> Maxim
>> On 11/13/2011 8:39 PM, Peter Schuller wrote:
>>>> I do limit the number of rows I'm asking for in Pycassa. Queries on 
>>>> primary
>>>> keys still work fine,
>>> Is it feasable in your situation to keep track of the oldest possible
>>> data (for example, if there is a single sequential writer that rotates
>>> old entries away it could keep a record of what the oldest might be)
>>> so that you can bound your index lookup>= that value (and avoid the
>>> tombstones)?

View raw message