db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Matrigali <mikem_...@sbcglobal.net>
Subject Re: Question/observations on purged rows and insert rates
Date Fri, 06 Aug 2010 17:55:09 GMT
Bergquist, Brett wrote:
> Thanks for the response Mike, next week I will put together a test program that I can
use to attach to a Jira issue on this.  
> This info on the PURGE_ROWS option is go.  This should probably go into the manual as
to why one would want to use it.  Now I understand better.  
> In my case, the table in question has mostly integer data types with no clobs/blobs.
 Rows are being added to the table in increasing time order and older rows (older than 30
days for example) are being deleted.
> Just for a little more background, for the test that I did, only this table is updated,
database and logging are being done on the same file system, multiple threads are inserting
rows into the table simultaneously.  After inserting a few million rows and observing the
insert rate, the insertion process was suspended, 1.5 million rows were removed by a single
thread, and the insertion process was restarted.  No shutdown of the database occurred.  Resuming
the insertion process saw the dramatic decrease in insertion speed.
> A shutdown on the database was done, a compact was performed using the embedded driver
to access the database, and the above test was performed.  After the compact, the speed, returned
to normal.
> I agree with your approach on using separate tables.  More research lead me to find that
this "partitioning" is supported native in some databases.  I prefer not to move to another
database as so far over the last 4 or 5 years, Derby has worked well for us in a "zero administration"
kind of way.  I would hate to give this up.  I probably will look into doing this "partitioning"
via procedure and convention.
> Thanks for your help.
> Brett
great, a repro is always best.  I would like to see derby handle this 
kind of application.  I am pretty sure the default space reclamation
is built well to handle this kind load, at least from the space reuse
side (it more has problems when apps
pick and choose rows to delete and those that don't have subsequent
inserts after the delete).  The problem may just be that deletes cause
work for the db (cpu, log I/O, and database I/O) and this work is 
affecting the immediate subsequent

Once you have the repro, the first thing to look at will be if the
async background work of the deletes is affecting the immediate 
following inserts.  So should see if after waiting some amount of time
after the delete if the insert rate goes back up.

I understand that this is not going to useful for your application, but 
it may shed more light on what is going on.  Could you run your existing
test past the point of doing the delete and then wait some amount of
time before restarting the inserts.  I am not sure exactly how long but
would guess an hour would be way more than enough.  I just want to 
understand if the issue is the background handling of the deletes 
affecting the inserts, or if the empty space in the db really causes
the issue.

You could also just actively monitor the process and see when the
cpu and I/O on the system goes dorment after the delete.  It is 
definitely the case that after the massive delete there will be work
that has been queued to background to be completed after the user
executes the commit for transaction that does the deletes.  In general
this work can take as much time as the actual delete (maybe even more
as the initial delete just sets a bit to mark the rows deleted so not
much logging - where the space reclamation can do a lot of I/O and cpu
to move rows around and mark the pages free).
> -----Original Message-----
> From: Mike Matrigali [mailto:mikem_app@sbcglobal.net] 
> Sent: Friday, August 06, 2010 1:00 PM
> To: derby-dev@db.apache.org
> Subject: Re: Question/observations on purged rows and insert rates
> Bergquist, Brett wrote:
>> I posted the following to the users list but saw no response, so I 
>> thought I would post here as it may be more relevant to developers.
>> After some testing, it seems that Derby is reusing the space of deleted 
>> records before allocating more space from the file system.  If this is 
>> the case then what use does the call:
>>                 call 
>> syscs_util.syscs_inplace_compress_table(SCHEMA,TABLE, 1, 0, 0);
>> have?  Basically what does the PURGE_ROWS option do that is above and 
>> beyond what is being done by Derby already to reuse the space of deleted 
>> records?
> If you have no clobs/blobs in your table then derby by default will only 
> reclaim space once it thinks it can get a whole page.  So it waits
> until the LAST row on a page has been deleted before attempting to
> reclaim the space and make a full page available for subsequent use.
> The purge option of compress instead trys to reclaim the space of every
> single deleted row in the table.
>  From what you have writing it seems like most of your rows are likely
> to be in order in the file and you tend to delete them in chunks that
> are in order, so I would assume that the default space reclamation would
> work well in your case.
>> Also after testing we are seeing the following. With a database with no 
>> deleted rows, my test application is inserting about 150 records/second 
>> into a table.  I let this run for about 2 million records and the insert 
>> rate is consistent.  Now I purge out 1.5 million records and run the 
>> test again.   
> Does the above sentence mean you sql delete the first 1.5 million records?
> The insert rate is now about 25 records/second.  Running
>> the above compress with the PURGE_ROWS option and rerun the test and 
>> still about 25 records/second.  Run full SYSCS_UTIL.SYSCS_COMPRESS_TABLE 
>> and rerun the test and the insert rate is back to 150 records/second.
> I am not sure what is going on.  The best thing you could do if you want
> more help is to post a repro with as much detail as possible, this just
> answers a lot a questions about what you are doing rather than having
> to guess (where shutdowns happen, page size, blobs/clobs?, log on 
> separated disk or not, where commits happen, inserts single threaded, 
> deletes single threaded, ...).  We don't need your real data, but table 
> and index ddl that match your app would be best - with a program that 
> generated dummy data for inserts that reproduces the issue.  Go ahead 
> and log performance JIRA issue
> if you can reproduce this is a simple java program.
>> The reduced insert rate because of deleted records is very much a 
>> problem.  We have a table that gets about 700K records inserted per day 
>> and purges out 30 days old data at about 700K records per day.  This has 
>> a great effect on our insert rate.  Why the big hit because of deleted 
>> records and can anything other than a compress help?  This process has 
>> no downtime so running a compress can only be done maybe once a month.
> I will just note that space reclamation of deleted records does generate
> "background" work.  And this background work can affect concurrent
> insert peformance.
> If your application can handle the complications with using multiple 
> tables, I would suggest you go that way.  That is the approach I have
> seen most applications like this use across a wide range of databases.
> Inserts then are going to be best case performace, and reclaiming the
> space can be done "online" without affecting the data you might need
> to access.  Dropping an entire table is always going to be less overhead
> on the system then deleting the same number of rows from a table, as in
> the later case each row will be logged, and an number of cpu actions are
> taken for manipulating the bytes on the pages for each delete.  In the
> case of a drop we simply log the drop table and delete the file from
> the file system.
> You could use table functions or views to make it easier to query the 
> "current" set of data.
>> Thanks for an feedback.
>> Brett

View raw message