db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Raymond Raymond" <raymond_de...@hotmail.com>
Subject Re: Discussion of incremental checkpointing
Date Tue, 14 Feb 2006 12:46:16 GMT
Thanks a lot, Suresh, that's really a good idea.

Raymond


>From: Suresh Thalamati <suresh.thalamati@gmail.com>
>Reply-To: <derby-dev@db.apache.org>
>To: derby-dev@db.apache.org
>Subject: Re: Discussion of incremental checkpointing----Added some new 
>content
>Date: Fri, 10 Feb 2006 13:25:53 -0800
>
>
>My two cents:
>If the goal is do to auto-tune the checkpoint interval , I think amount of 
>log generated is a good indication of how fast the
>system is. By finding out how fast log is getting generated,
>one can predict with in reasonable error how long the recovery
>will take on that system.
>
>It might be worth to find a simple solution to start with, than trying to 
>get it perfect. How about some  simple solution to tune the checkpoint 
>interval like the following:
>
>Say on a particular system:
>
>1) we can generate N amount of log in X amount time on a running system.
>
>2) R is fraction of time it takes to recover a log generated in X amount of 
>time. One can pre calculate this factor by doing some tests instead of 
>trying to find on each boot.
>
>3) Y is the recovery time, derby users likely to see by
>default in the worst case.
>
>4) C is after how much log generation a checkpoint should be scheduled
>
>ideal checkpoint interval log size should be some thing like :
>
>    C = (N / X) * (Y * R)
>
>
>For example :
>
>X = 5 min,   N  = 50 MB., Y =  1 min, R = 0.5
>
>C = (50 / 5 ) * ( 1 / 0.5)    = 20 MB.
>
>
>One could use some formula like above at checkpoint to tune when should the 
>next checkpoint should occur , I understand first time it will be 
>completely Off,  but the interval will stabilize after a few checkpoints.
>
>I think irrespective of whatever approach is finally implemented, we have 
>to watch  for overhead introduced to get it exactly right, I don't think 
>user will really care if recovery takes few seconds
>less or more.
>
>Aside from tuning the checkpoint interval, I think we should find ways to 
>minimize the effect of checkpoint on the system throughput.
>I guess that is a different topic.
>
>
>
>Thanks
>-suresht
>
>
>
>Raymond Raymond wrote:
>>Mike,
>>I am sorry, I did not make it very clear what I am going to do.
>>I will explain it now. I think the last thing you mentioned is what
>>I want to solve :
>>
>>MM  how long it takes to do a checkpoint vs. recovery time
>>MM  requirements of the applications?
>>
>>I am working on the issue of automatic checkpointing which makes
>>the derby engine control the checkpointing process by itself depends
>>on the runtime situation.  The goal is to make a good balance between
>>runtime resource consumption of the checkpointing process (especially
>>disk I/O resource), and the recovery time. I want to do checkpointing as
>>much as possible while have less interference with the real work of derby.
>>(Most of the system resource I mentioned here is the disk I/O resource,
>>since as for the checkpointing issue, disk I/O resource is the bottleneck
>>over all the system resources)
>>  Let's look into the current derby checkpointing mechanism first. If we
>>set the checkpointing interval too short, derby will do checkpointing very
>>often, which will take lots of disk I/O resource, and the reponses to the
>>requests from clients will be delayed. Conversely, if we set the 
>>checkpointing
>>interval too long, derby will keep lots of data in cache and when crush
>>happens, the recovery time will be very long. I am trying to make a good
>>balance between them.
>>  Then, let me show the benefits of my proposal. My basic idea is to do
>>checkpoint as much as possible when disk I/O is not busy. I think the
>>incremental checkpointing mechanism combined with consideration of the
>>runtime situation( what we discussed last time -- "Discussion of how to 
>>map
>>the recovery time into Xmb of log", about statistic of system performance
>>information, time of a data reads or time of a log writes, etc.) can solve 
>>the
>>problem.
>>  We can imagine that the incremental checkpointing mechanism
>>divides the current checkpoint process into several pieces.  We let the 
>>user
>>to setup an acceptable recovery time.Then, when the system is not busy,
>>we do a piece of checkpoint; when the system becomes busy,we suspend
>>the checkpoint process for a while, and so on. But if the log is longer 
>>then
>>the acceptable length, we try to do checkpoint even the system is busy to
>>meet the acceptable recovery time set by the user. We make each piece of
>>checkpoint an intact checkpointing process by updating the log control 
>>file.
>>I think what you described in your comments is what the incremental
>>checkpointing will do:
>>
>>MM  I think what you want is at step 2 to somehow write multiple
>>MM  checkpoint log records rather than wait to the end.  Let's assume 
>>previous REDO
>>MM  MARK was LOGEND_PREV.  So while writing the pages you would write a 
>>new
>>MM  type of checkpoint record that would move the REDO mark up to 
>>somewhere
>>MM  between LOGEND_PREV and current end of log file.  I think I agree that
>>MM  if you wrote a checkpoint record for every I/O from your ordered list
>>MM  then you would guarantee minimum redo recovery time, where the current
>>MM  system writes one log record at end of all I/O's which at the end of
>>MM  writing all page would match your approach (a little hard to compare
>>MM  as your approach is continuous but if you just compare the dirty page
>>MM  list at LOGEND I think this is true).
>>
>>To establish a dirty page list is just my suggestion. The purpose of that 
>>is to
>>sort the dirty pages in ascending order of the time when they were first
>>updated. If we can have any other ideas to do that without using extra
>>memory,we don¡¯t have to establish such a list.
>>
>>Anyone has any suggestion on it? Everyone is welcome to give your opinion
>>on it.
>>
>>
>>
>>Raymond
>>
>>
>>>From: Mike Matrigali <mikem_app@sbcglobal.net>
>>>
>>>I am hoping maybe someone else will get in this discussion, I
>>>am not seeing the benefit vs runtime cost of the incremental checkpoint
>>>approach.
>>>It could be I am blinded by the systems that I have worked on.  I
>>>just have this gut reaction to adding another queue to the I/O
>>>system, going foward I would rather see Derby more parallel and
>>>a single queue seems to be the wrong direction.
>>>
>>>Again what problem are you trying to solve? My guesses:
>>>1) avoid checkpoint I/O saturation (I think this needs to be done but
>>>   this can be done in current checkpoint scheme).
>>>2) reduce number of redo recovery log records (how important is this,
>>>   at cost to runtime list manipulation)
>>>3) some other problem?
>>>
>>>I think you are trying to solve 2 - which I admit I don't see as much
>>>of a problem.  Currently at a high level (ignoring details) we do:
>>>1) start checkpoint, note current end of log file (LOGEND)
>>>2) we should slowly write pages all dirty pages in cache (I agree we
>>>    need a fix in this area to current scheme)
>>>3) when done write checkpoint log record indicating REDO mark at LOGEND,
>>>    now log may be at LOGEND + N
>>>
>>>I think what you want is at step 2 to somehow write multiple checkpoint
>>>log records rather than wait to the end.  Let's assume previous REDO
>>>MARK was LOGEND_PREV.  So while writing the pages you would write a new
>>>type of checkpoint record that would move the REDO mark up to somewhere
>>>between LOGEND_PREV and current end of log file.  I think I agree that
>>>if you wrote a checkpoint record for every I/O from your ordered list
>>>then you would guarantee minimum redo recovery time, where the current
>>>system writes one log record at end of all I/O's which at the end of
>>>writing all page would match your approach (a little hard to compare
>>>as your approach is continuous but if you just compare the dirty page
>>>list at LOGEND I think this is true).
>>>
>>>So again, let's first talk about what you want to solve rather than
>>>how to solve it.  Maybe you have some assumptions about what type of
>>>system you are trying to address, like:  size of cache, percentage of
>>>dirty pages, how long it takes to do a checkpoint vs. recovery time
>>>requirements of the applications?
>>>
>>>Raymond Raymond wrote:
>>>
>>> >
>>> >> From: Mike Matrigali <mikem_app@sbcglobal.net>
>>> >>
>>> >> I think this is the right path, though would need more details:
>>> >> o does boot mean first time boot for each db?
>>> >> o how to determine "this machine"
>>> >> o and the total time to run such a test.
>>> >>
>>> >> There are some very quick and useful tests that would be fine to
>>> >> add to the default system and do one time per database    Measureing
>>> >> how long to do a commit and how long to do a single database read 
>>>from
>>> >> disk would be fine.  Seems like
>>> >> just these 2 numbers could be used to come up with a very good
>>> >> default estimate of log recovery time per log record.  Then as you
>>> >> propose the actual estimate can be improved by meauring real
>>> >> recovery time in the future.
>>> >>
>>> >> I am not convinced of the need for the bigger test, but if the 
>>>default
>>> >> is not to run it automatically and it is your "itch" to have such
>>> >> a configuration option then I would not oppose.  I do see great value
>>> >> in coming up with a very good default estimate of recovery time 
>>>estimate
>>> >> based on outstanding number of log records.  And
>>> >> I even envision
>>> >> a framework in the future where derby would schedule other 
>>>non-essential
>>> >> background tasks that have been discussed in the
>>> >>
>>> >> On a different track I am still unclear on the checkpoint dirty page
>>> >> lru list.  Rather than talk about implementation details, I would
>>> >> like to understand the problem you are trying to solve.  For instance
>>> >> I well understand the goal to configure checkpoints such that they
>>> >> map to user understandable concept of the tradeoff of current runtime
>>> >> performance vs. how long am I willing to wait the next time I boot
>>> >> the database after a crash.
>>> >>
>>> >> What is the other problem you are looking at.
>>> >>
>>> >
>>> >
>>> > Mike:
>>> >
>>> > What I am looking at next is to redesign the checkpointing process.
>>> > The current checkpointing mechanism will write out all the dirty pages
>>> > during the checkpoint. That causes a burst of disk I/O. Lots of 
>>>problems
>>> > were mentioned by some people, such the DERBY-799 reported by Oystein.
>>> > I have a proposal of incremental checkpointing. I have mentioned it
>>> > before,I would like to explain it in more detail.
>>> >
>>> > We should find some way to sort the dirty pages in ascending order of
>>> > the time when they were firt updated.The incremental checkpointing
>>> > process will continually write out the dirty pages from the earliest
>>> > updated dirty page to the latest updated dirty page. The writing rate
>>> > is related to the system situation.
>>> > There are two situations in which we will update the log control file:
>>> > 1)A data reads or a log writes start to have a longer response time
>>> > then an acceptable value, we update the log control and sleep for a
>>> > while.
>>> > 2)After writing out a certain number of dirty pages
>>> >
>>> > The benefits of it are :
>>> > 1)since we wirte out dirty pages from the earliest updated page to the
>>> > latest updated page, the checkpoint instance will keep advancing.Since
>>> > the incremental checkpoint is performed continuously, the checkpoint
>>> > instance will be much closer to the tail of the log than the 
>>>conventional
>>> > checkpointing.
>>> > 2)the checkpointing process can be paused if the disk I/O becomes 
>>>really
>>> > busy, and the finished part is an intact checkpoint instance.
>>> >
>>> > Do you still remember I suggested to establish a establish a dirty 
>>>page
>>> > list in wich dirty pages are sorted in ascending order of the time 
>>>when
>>> > they were firt updated? I would like to discuss on it again.
>>> >
>>> > Actually the list is not designed to speed up the checkpoint process. 
>>>It
>>> > is for the incremental checkpointing described above.To make the 
>>>checkpoint
>>> > instance keep advancing, We should guarantee the earlier updated pages 
>>>have
>>> > been written out.That's why I suggested to establish such a list.
>>> >
>>> > In the last disucssion, you also mentioned a problem:
>>> >
>>> > MM  The downside with the
>>> > MM  current
>>> > MM  algorithm is that a page that is made dirty after the checkpoint
>>> > MM  starts
>>> > MM  will be written, and if it gets written again before the next
>>> > MM  checkpoint
>>> > MM  we have done 1 too many I/O's.  I think I/O optimization may 
>>>benefit
>>> > MM  more by working on optimizing the background I/O thread than 
>>>working
>>> > MM  on the checkpoint.
>>> >
>>> >
>>> > If the background I/O thread can refer to this list.I think it can 
>>>help
>>> > solve the problem you mentioned. I am not very familiar with the 
>>>background
>>> > I/O thread. If I am wrong, please point it out.
>>> >
>>> > In the list, the dirt pages are sorted in ascending order of the time 
>>>when
>>> > they were firt updated, which means the oldest dirty page is in the 
>>>head of
>>> > the list and the latest updated dirty page is in the end of the list.
>>> > The operations on the list are :
>>> > - When a page is updated and it is not in the list, we will append it 
>>>to
>>> > the end of the list.
>>> > - When a dirty page in the list is written out to disk, it will be 
>>>released
>>> > from the list.
>>> >
>>> > Let's look into your problem:
>>> >  if a page is made dirty after the checkpoint starts,
>>> >
>>> > 1) if the page was made dirty before this update, it was supposed to 
>>>be
>>> >  in the list already.We don't need to add it again.
>>> >  When the checkpoint process writes this dirty page out to disk, it 
>>>will
>>> >  be released from the list, and if the background I/O thread refer to
>>> >  the list, it will know it's no need to write this page out again.
>>> > 2) if the page was first time updated. It will be appended to the end
>>> >  of the list.If the background I/O thread refer to the list, it knows
>>> >  it's no need to write this page out so soon since it has just been
>>> >  updated.
>>> >
>>> >
>>> > Is it resonable?
>>> >
>>> >
>>> > Raymond
>>> >
>>> > _________________________________________________________________
>>> > Take advantage of powerful junk e-mail filters built on patented
>>> > Microsoft?SmartScreen Technology.
>>> > 
>>>http://join.msn.com/?pgmarket=en-ca&page=byoa/prem&xAPID=1994&DI=1034&SU=http://hotmail.com/enca&HL=Market_MSNIS_Taglines
>>>
>>> >  Start enjoying all the benefits of MSN?Premium right now and get the
>>> > first two months FREE*.
>>> >
>>> >
>>
>>
>>_________________________________________________________________
>>Don't just Search. Find! http://search.sympatico.msn.ca/default.aspx The 
>>new MSN Search! Check it out!
>>
>>
>

_________________________________________________________________
Take charge with a pop-up guard built on patented Microsoft® SmartScreen 
Technology. 
http://join.msn.com/?pgmarket=en-ca&page=byoa/prem&xAPID=1994&DI=1034&SU=http://hotmail.com/enca&HL=Market_MSNIS_Taglines

  Start enjoying all the benefits of MSN® Premium right now and get the 
first two months FREE*.


Mime
View raw message