db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Matrigali <mikem_...@sbcglobal.net>
Subject Re: Discussion of incremental checkpointing----Added some new content
Date Mon, 13 Feb 2006 19:55:55 GMT


Øystein Grøvlen wrote:
> I would like to see the changes to checkpointing that Raynmond suggests. 
>   The main reason I like this, is that it provides separation of 
> concerns.  It cleanly separates the work to reduce recovery time 
> (checkpointing) from the work to make sure that a sufficient part of the 
> cache is clean (background writer).  I think what Raymond suggests is 
> similar to the way ARIES propose to do checkpointing.  As far as I 
> recall, ARIES goes a step further since checkpointing does not involve 
> any writing of pages at all. It just update the control file based on 
> the oldest dirty page.
> 
> 
> Mike Matrigali wrote:
> 
>> I think my main issue is that I don't see that it is important to
>> optimize writing the cached dirty data.  Especially since the order
>> that you are proposing writing the dirty data is exactly the wrong
>> order to the current cache performance goal to minimize the number of 
>> total I/O's the
>> system is going to do (a page that is the oldest written exists in
>> a busy cache most likely because it has been written many times -
>> otherwise the standard background I/O thread would have written
>> it already).
> 
> 
> I think your logic is flawed if you are talking about checkpointing (and 
> not the background writer).  If you want to guarantee a certain recovery 
> time, you will need to write the oldest page.  Otherwise, you will not 
> be able to advance the starting point for recovery.  This approach to 
> checkpointing should reduce the number of I/Os since you are not writing 
> a busy page until it is absolutely necessary.  The current checkpointing 
> writes a lot of pages which does not do anything to make it possible to 
> garbage-collect log. Those pages should be left to the background 
> writer, which can use its own criteria for which pages are optimal to 
> write.

I guess I was not clear, I agree with you:
     checkpoint - wants to write oldest page, I agree this is necessary 
to move the redo low water mark.
     background - wants to write least used, probably not oldest page.

What pages are you talking about that the current checkpoint process 
writes that are not necessary.  Are they the ones that go from clean
to dirty after the checkpoint starts?  It seems that in current 
checkpoint all pages dirty at the start are necessary to move the redo
low water mark.
> 
> Raymond suggest to use his queue also for the background writer.  This 
> is NOT a good idea!  The background writer should write those pages that 
> are least likely to be accessed in the near future since they are the 
> best candidates to be replaced in the cache.  Currently a clock 
> algorithm is used for this.  I am not convinced that is the best 
> approach.  I suspect that an LRU-based algorithm would be much better. 
> (But this is separate discussion.)

I agree.  The background writer has a different job.  And it could be
optimized, but that is not this discussion.

> 
>> If we knew derby was the only process on the machine then an approach
>> as you suggest might be reasonable, ie. we own all resources so we
>> should max out the use of all those resources.  But as a zero admin
>> embedded db I think derby should be more conservative in it's
>> resource usage.
> 
> 
> I agree, and I think that an incremental approach makes that easier. You 
> are more free to pause the writing activity without significantly 
> impacting recovery time.  With the current checkpointing, slowing down 
> the writing will more directly increase the recovery time.
> 
> If we have determined that 20 MB of log will give a decent recovery 
> time, we can write the pages at the head of the queue at a rate that 
> tries to keep the amount of active log around 20 MB.  This should spread 
> checkpointing I/O more evenly over time instead of the bursty behavior 
> we have today.
> 
>>
>> I agree that your incremental approach optimizes recovery time, I
>> just don't think that any runtime performance hit is worth it (or even
>> extra complication of the checkpoint algorithms at no runtime cost).  The
>> system should, as you propose, attempt to guarantee a maximum
>> recovery time - but I see no need to work hard (ie. use extra
>> resources) to guarantee better than that.  Recovery is an edge case,
>> it should not be optimized for.
> 
> 
> I agree, and that is why I think that we should write as few pages as 
> possible with recovery time in mind (i.e., during checkpointing).  In 
> other words, we should only write pages that will actually advance the 
> starting point for recovery.
> 
>>
>> Also note that the current checkpoint does 2 operations to insure
>> each page is on disk, you can not assume the page has hit disk
>> until both are complete.  It first uses java write (which is async
>> by default), and then it forces the entire file.  The second step
>> is a big overhead on some systems so is not appropriate to do
>> for each write (where the overhead is cpu linear to the size of file
>> rather than the number of dirty pages).
> 
> 
> I think we SHOULD sync for every I/O, but not the way we do today.  By 
> opening the files with "rwd", we should be able to do this pretty 
> efficiently already today.  (At least on some systems.  I am not sure 
> about non-POSIX systems like windows.)  Syncing for every I/O gives us 
> much more control over the I/O, and we will not be vulnerable to queuing 
> effects that we do not control.

Do you think we should sync for every I/O in the non-checkpoint case 
also.  The case I am most interested in, is where a user transaction
needs to wait for a page in the cache and the only way to give that
page is by writing another page in the cache out.  Currently this write
is async, are you proposing to change this to a sync write?

> 
>>  This has been discussed
>> previously on the list.  As has been pointed out the most efficient
>> method of writing out a number of pages is to somehow queue a small
>> number of writes async, and then wait for all to finish before
>> queueing the next set.  Unfortunately standard OS mechanisms to do
>> this don't exist yet in JAVA, they are being proposed in some new
>> JSR's.  I have been waiting for patches from others, but if one doesn't
>> come I will change the current checkpoint before the next release to 
>> queue small number of
>> writes and then wait for the estimated time of executing those writes,
>> and then continue to queue more writes.  This should solve 90% of
>> the checkpoint I/O flood issue.
> 
> 
> I have been planning to address this for a while, but have not been able 
> to do that so far.  I was planning to experiment a bit with the syncing 
> I describe above to see if there are scenarios were such an approach 
> would not give sufficient throughput.  If that is the case, we would 
> need to parallelize the writing.  If I do not have time to do that, I 
> would go for something simpler as you describe above.
> 
> -- 
> Øystein
> 
> 


Mime
View raw message