cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxim Potekhin <potek...@bnl.gov>
Subject Re: Mass deletion -- slowing down
Date Mon, 14 Nov 2011 01:25:47 GMT
Brandon,

thanks for the note.

Each row represents a computational task (a job) executed on the grid or 
in the cloud. It naturally has a timestamp as one of its attributes, 
representing the time of the last update. This timestamp
is used to group the data into "buckets" each representing one day in 
the system's activity.
I create the "DATE" attribute and add it to each row, e.g. it's a column 
{'DATE','20111113'}.
I create an index on that column, along with a few others.

Now, I want to rotate the data out of my database, on daily basis. For 
that, I need to
select on 'DATE' and then do a delete.

I do limit the number of rows I'm asking for in Pycassa. Queries on 
primary keys still work fine,
it's just the indexed queries that start to time out. I changed timeouts 
and number of retries
in the Pycassa pool, but that doesn't seem to help.

Thanks,
Maxim

On 11/13/2011 8:00 PM, Brandon Williams wrote:
> On Sun, Nov 13, 2011 at 6:55 PM, Maxim Potekhin<potekhin@bnl.gov>  wrote:
>> Thanks to all for valuable insight!
>>
>> Two comments:
>> a) this is not actually time series data, but yes, each item has
>> a timestamp and thus chronological attribution.
>>
>> b) so, what do you practically recommend? I need to delete
>> half a million to a million entries daily, then insert fresh data.
>> What's the right operation procedure?
> I'd have to know more about what your access pattern is like to give
> you a fully informed answer.
>
>> For some reason I can still select on the index in the CLI, it's
>> the Pycassa module that gives me trouble, but I need it as this
>> is my platform and we are a Python shop.
> This seems odd, since the rpc_timeout is the same for all clients.
> Maybe pycassa is asking for more data than the cli?
>
> -Brandon


Mime
View raw message