db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Øystein Grøvlen <Oystein.Grov...@Sun.COM>
Subject Re: Discussion of incremental checkpointing----Added some new content
Date Sat, 11 Feb 2006 19:37:18 GMT
I would like to see the changes to checkpointing that Raynmond suggests. 
   The main reason I like this, is that it provides separation of 
concerns.  It cleanly separates the work to reduce recovery time 
(checkpointing) from the work to make sure that a sufficient part of the 
cache is clean (background writer).  I think what Raymond suggests is 
similar to the way ARIES propose to do checkpointing.  As far as I 
recall, ARIES goes a step further since checkpointing does not involve 
any writing of pages at all. It just update the control file based on 
the oldest dirty page.


Mike Matrigali wrote:
> I think my main issue is that I don't see that it is important to
> optimize writing the cached dirty data.  Especially since the order
> that you are proposing writing the dirty data is exactly the wrong
> order to the current cache performance goal to minimize the number of 
> total I/O's the
> system is going to do (a page that is the oldest written exists in
> a busy cache most likely because it has been written many times -
> otherwise the standard background I/O thread would have written
> it already).

I think your logic is flawed if you are talking about checkpointing (and 
not the background writer).  If you want to guarantee a certain recovery 
time, you will need to write the oldest page.  Otherwise, you will not 
be able to advance the starting point for recovery.  This approach to 
checkpointing should reduce the number of I/Os since you are not writing 
a busy page until it is absolutely necessary.  The current checkpointing 
writes a lot of pages which does not do anything to make it possible to 
garbage-collect log. Those pages should be left to the background 
writer, which can use its own criteria for which pages are optimal to write.

Raymond suggest to use his queue also for the background writer.  This 
is NOT a good idea!  The background writer should write those pages that 
are least likely to be accessed in the near future since they are the 
best candidates to be replaced in the cache.  Currently a clock 
algorithm is used for this.  I am not convinced that is the best 
approach.  I suspect that an LRU-based algorithm would be much better. 
(But this is separate discussion.)

> If we knew derby was the only process on the machine then an approach
> as you suggest might be reasonable, ie. we own all resources so we
> should max out the use of all those resources.  But as a zero admin
> embedded db I think derby should be more conservative in it's
> resource usage.

I agree, and I think that an incremental approach makes that easier. 
You are more free to pause the writing activity without significantly 
impacting recovery time.  With the current checkpointing, slowing down 
the writing will more directly increase the recovery time.

If we have determined that 20 MB of log will give a decent recovery 
time, we can write the pages at the head of the queue at a rate that 
tries to keep the amount of active log around 20 MB.  This should spread 
checkpointing I/O more evenly over time instead of the bursty behavior 
we have today.

> 
> I agree that your incremental approach optimizes recovery time, I
> just don't think that any runtime performance hit is worth it (or even
> extra complication of the checkpoint algorithms at no runtime cost).  The
> system should, as you propose, attempt to guarantee a maximum
> recovery time - but I see no need to work hard (ie. use extra
> resources) to guarantee better than that.  Recovery is an edge case,
> it should not be optimized for.

I agree, and that is why I think that we should write as few pages as 
possible with recovery time in mind (i.e., during checkpointing).  In 
other words, we should only write pages that will actually advance the 
starting point for recovery.

> 
> Also note that the current checkpoint does 2 operations to insure
> each page is on disk, you can not assume the page has hit disk
> until both are complete.  It first uses java write (which is async
> by default), and then it forces the entire file.  The second step
> is a big overhead on some systems so is not appropriate to do
> for each write (where the overhead is cpu linear to the size of file
> rather than the number of dirty pages).

I think we SHOULD sync for every I/O, but not the way we do today.  By 
opening the files with "rwd", we should be able to do this pretty 
efficiently already today.  (At least on some systems.  I am not sure 
about non-POSIX systems like windows.)  Syncing for every I/O gives us 
much more control over the I/O, and we will not be vulnerable to queuing 
effects that we do not control.

>  This has been discussed
> previously on the list.  As has been pointed out the most efficient
> method of writing out a number of pages is to somehow queue a small
> number of writes async, and then wait for all to finish before
> queueing the next set.  Unfortunately standard OS mechanisms to do
> this don't exist yet in JAVA, they are being proposed in some new
> JSR's.  I have been waiting for patches from others, but if one doesn't
> come I will change the current checkpoint before the next release to 
> queue small number of
> writes and then wait for the estimated time of executing those writes,
> and then continue to queue more writes.  This should solve 90% of
> the checkpoint I/O flood issue.

I have been planning to address this for a while, but have not been able 
to do that so far.  I was planning to experiment a bit with the syncing 
I describe above to see if there are scenarios were such an approach 
would not give sufficient throughput.  If that is the case, we would 
need to parallelize the writing.  If I do not have time to do that, I 
would go for something simpler as you describe above.

--
Øystein

Mime
View raw message