db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suresh Thalamati <suresh.thalam...@gmail.com>
Subject Re: Discussion of incremental checkpointing----Added some new content
Date Mon, 13 Feb 2006 20:13:10 GMT
I agree, it will be nice to have checkpoints(light checkpoint) that 
does not need to flush whole cache to the disk, especially when the 
cache is configured to be very large. I also remember reading about it 
long time ago some where. I think the basic idea is to keep track of 
the highest LSN of the pages that got flushed to the disk and at 
checkpoint flush any pages with LSN lesser than are still in the 
cache, this might be achieved by keeping first LSN that updated the 
page , in addition to the last LSN that currently used to flush the 
log when page is written to the disk. Main difference between the 
current check point and this one will be REDO Low Water Mark can be 
long before the checkpoint log record. In the worst case scenario it 
will be same as the current checkpoint.

Changing the page writes to synhcronous using "rwd" is good idea when 
cache is large. In small cache sizes like the default of 1000 pages ,
it might be problem  because user request for a empty page is likely 
to trigger foreground sync writes.

Thanks
-suresht

Øystein Grøvlen wrote:
> I would like to see the changes to checkpointing that Raynmond suggests. 
>   The main reason I like this, is that it provides separation of 
> concerns.  It cleanly separates the work to reduce recovery time 
> (checkpointing) from the work to make sure that a sufficient part of the 
> cache is clean (background writer).  I think what Raymond suggests is 
> similar to the way ARIES propose to do checkpointing.  As far as I 
> recall, ARIES goes a step further since checkpointing does not involve 
> any writing of pages at all. It just update the control file based on 
> the oldest dirty page.
> 
> 
> Mike Matrigali wrote:
> 
>> I think my main issue is that I don't see that it is important to
>> optimize writing the cached dirty data.  Especially since the order
>> that you are proposing writing the dirty data is exactly the wrong
>> order to the current cache performance goal to minimize the number of 
>> total I/O's the
>> system is going to do (a page that is the oldest written exists in
>> a busy cache most likely because it has been written many times -
>> otherwise the standard background I/O thread would have written
>> it already).
> 
> 
> I think your logic is flawed if you are talking about checkpointing (and 
> not the background writer).  If you want to guarantee a certain recovery 
> time, you will need to write the oldest page.  Otherwise, you will not 
> be able to advance the starting point for recovery.  This approach to 
> checkpointing should reduce the number of I/Os since you are not writing 
> a busy page until it is absolutely necessary.  The current checkpointing 
> writes a lot of pages which does not do anything to make it possible to 
> garbage-collect log. Those pages should be left to the background 
> writer, which can use its own criteria for which pages are optimal to 
> write.
> 
> Raymond suggest to use his queue also for the background writer.  This 
> is NOT a good idea!  The background writer should write those pages that 
> are least likely to be accessed in the near future since they are the 
> best candidates to be replaced in the cache.  Currently a clock 
> algorithm is used for this.  I am not convinced that is the best 
> approach.  I suspect that an LRU-based algorithm would be much better. 
> (But this is separate discussion.)
> 
>> If we knew derby was the only process on the machine then an approach
>> as you suggest might be reasonable, ie. we own all resources so we
>> should max out the use of all those resources.  But as a zero admin
>> embedded db I think derby should be more conservative in it's
>> resource usage.
> 
> 
> I agree, and I think that an incremental approach makes that easier. You 
> are more free to pause the writing activity without significantly 
> impacting recovery time.  With the current checkpointing, slowing down 
> the writing will more directly increase the recovery time.
> 
> If we have determined that 20 MB of log will give a decent recovery 
> time, we can write the pages at the head of the queue at a rate that 
> tries to keep the amount of active log around 20 MB.  This should spread 
> checkpointing I/O more evenly over time instead of the bursty behavior 
> we have today.
> 
>>
>> I agree that your incremental approach optimizes recovery time, I
>> just don't think that any runtime performance hit is worth it (or even
>> extra complication of the checkpoint algorithms at no runtime cost).  The
>> system should, as you propose, attempt to guarantee a maximum
>> recovery time - but I see no need to work hard (ie. use extra
>> resources) to guarantee better than that.  Recovery is an edge case,
>> it should not be optimized for.
> 
> 
> I agree, and that is why I think that we should write as few pages as 
> possible with recovery time in mind (i.e., during checkpointing).  In 
> other words, we should only write pages that will actually advance the 
> starting point for recovery.
> 
>>
>> Also note that the current checkpoint does 2 operations to insure
>> each page is on disk, you can not assume the page has hit disk
>> until both are complete.  It first uses java write (which is async
>> by default), and then it forces the entire file.  The second step
>> is a big overhead on some systems so is not appropriate to do
>> for each write (where the overhead is cpu linear to the size of file
>> rather than the number of dirty pages).
> 
> 
> I think we SHOULD sync for every I/O, but not the way we do today.  By 
> opening the files with "rwd", we should be able to do this pretty 
> efficiently already today.  (At least on some systems.  I am not sure 
> about non-POSIX systems like windows.)  Syncing for every I/O gives us 
> much more control over the I/O, and we will not be vulnerable to queuing 
> effects that we do not control.
> 
>>  This has been discussed
>> previously on the list.  As has been pointed out the most efficient
>> method of writing out a number of pages is to somehow queue a small
>> number of writes async, and then wait for all to finish before
>> queueing the next set.  Unfortunately standard OS mechanisms to do
>> this don't exist yet in JAVA, they are being proposed in some new
>> JSR's.  I have been waiting for patches from others, but if one doesn't
>> come I will change the current checkpoint before the next release to 
>> queue small number of
>> writes and then wait for the estimated time of executing those writes,
>> and then continue to queue more writes.  This should solve 90% of
>> the checkpoint I/O flood issue.
> 
> 
> I have been planning to address this for a while, but have not been able 
> to do that so far.  I was planning to experiment a bit with the syncing 
> I describe above to see if there are scenarios were such an approach 
> would not give sufficient throughput.  If that is the case, we would 
> need to parallelize the writing.  If I do not have time to do that, I 
> would go for something simpler as you describe above.
> 
> -- 
> Øystein
> 


Mime
View raw message