db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suresh Thalamati <suresh.thalam...@gmail.com>
Subject Re: [jira] Commented: (DERBY-733) Starvation in RAFContainer.readPage()
Date Fri, 16 Dec 2005 19:28:41 GMT
This might be obvious, thought I would mention it any way. My 
understanding is one can not just enable "RWD" (direct io) for the 
checkpoint. It has to be enabled for all the writes from the page 
cache, otherwise a file sync is required before doing "rwd" writes 
because I am not sure If  a file is opened in "rw" mode  and then in 
"rws" mode , writes to first open will also get synced to the disk , 
when file is opened in "rwd" mode , I doubt that.

If files are opened in direct io mode always , then  page cache 
cleaning can possible get slow and also user query request for a new 
page in buffer pool can become slow if a cache is full and a page has 
to be thrown out to get a free page.  Another thing to note is buffer 
cleaning is done on Rawstore  daemon thread, which  is overloaded with 
some post commit work also , so page cache may not get cleaned often 
in some cases.


Thanks
-suresht


Mike Matrigali wrote:
> excellent, I look forward to your work on concurrent I/O.  I am likely
> to not be on the list much for the next 2 weeks, so won't be able to
> help much.  In thinking about this issue I was hoping that somehow
> the current container cache could be enhanced to support more than
> one open container per container.  Then one would automatically get
> control over the open file resource across all containers, by setting
> the currently supported "max" on the container pool.
> 
> The challenge is that this would be a new concept for the basic services
> cache implementation.  What we want is a cache that supports multiple
> objects with the same key, and that returns an available one if another
> one is "busy".  Also returns a newly opened one, if all are busy.  I
> am going to start a thread on this, to see if any other help is
> available.  If possible I like this approach better than having a queue 
> of open files per container where it hard to control the growth of one 
> queue vs. the growth in another.
> 
> On the checkpoint issue, I would not have a problem with changes to the
> current mechanism to do "rwd" type sync I/O rather than sync at end (but 
> we will have to support both until we don't have to support older 
> versions of JVM's).  I believe this is as close
> to "direct i/o" as we can get from java - if you mean something 
> different here let me know.  The benefit is that I believe it will fix
> the checkpoint flooding the I/O system problem.  The downside is that
> it will cause total number of I/O's to increase in cases where the
> derby block size is smaller than the filesystem/disk blocksize -- 
> assuming the OS currently converts our flood of multiple async writes to 
> the same file to a smaller number of bigger I/O's.  I think this trade 
> off is fine for checkpoints.  If checkpoint efficiency is an issue, 
> there are a number of other ways to address it in the future.
> 
> Øystein Grøvlen wrote:
> 
>>>>>>> "MM" == Mike Matrigali <mikem_app@sbcglobal.net> writes:
>>
>>
>>
>>     MM> user thread initiated read
>>     MM>      o should  be high priority and  should be "fair"  with 
>> other user
>>     MM>        initiated reads.
>>
>>     MM>      o These happen anytime a read of a row causes a cache miss.
>>     MM>      o Currently only one I/O operation to a file can happen 
>> at a time,
>>     MM>        could be big problem for some types of multi-threaded,
>>     MM>        highly concurrent low number of table apps.  I think
>>     MM>        the path here should be to increase the number of
>>     MM>        concurrent I/O's allowed to be outstanding by allowing
>>     MM>        each thread to have 1 (assuming sufficient open file
>>     MM>        resources).  100 outstanding I/O's to a single file may
>>     MM>        be overkill, but in java we can't know that the file is
>>     MM>        not actually 100 disks underneath.  The number of I/O's
>>     MM>        should grow as the actual application load increases,
>>     MM>        note I still think max I/O's should be tied to number
>>     MM>        of user threads, plus maybe a small number for
>>     MM>        background processing.
>>
>> There was an interesting paper at the last VLDB conference that
>> discussed the virtue of having many outstanding I/O requests:
>>     http://www.vldb2005.org/program/paper/wed/p1116-hall.pdf (paper)
>>     http://www.vldb2005.org/program/slides/wed/s1116-hall.pdf (slides)
>>
>> The basic message is that many outstanding requests are good.  The
>> SCSI controller they used in their study was able to handle 32
>> concurrent requests.  One reason database systems have been
>> conservative with respect to outstanding requests is that they want to
>> control the priority of the I/O requests.  We would like user thread
>> initiated requests to have priority over checkpoint initiated writes.
>> (The authors suggest building priorities into the file system to solve
>> this.)
>>
>> I plan to start working on a patch for allowing more concurrency
>> between readers within a few weeks.  The main challenge is to find the
>> best way to organize the open file descriptors (reuse, limit the max.
>> number etc.)  I will file a JIRA for this.
>>
>> I also think we should consider mechanisms for read ahead.
>>
>>     MM> user thread initiated write
>>     MM>       o same issues as user initiated read.
>>     MM>       o happens way less than read, as it should only happen 
>> on a cache
>>     MM>         miss that can't find a non-dirty page in the cache.  
>> background
>>     MM>         cache cleaner  should be  keeping this from  
>> happening, though
>>     MM>         apps that only do updates and cause cache hits are 
>> worst case.
>>
>>
>>     MM> checkpoint initiated write:
>>     MM>       o sometimes too many checkpoints happen in too short a 
>> time.
>>     MM>       o needs an improved scheduling algorithm, currently just 
>> defaults
>>     MM>         to N number of bytes to the log file no matter what 
>> the speed of
>>     MM>         log writes are.
>>     MM>       o currently may flood the I/O system causing user 
>> reads/writes to
>>     MM>         stall - on  some OS/JVM's this stall is  amazing like 
>> ten's of
>>     MM>         seconds.
>>     MM>       o  It is not  important that  checkpoints run  fast, it  
>> is more
>>     MM>         important  that it  prodede methodically  to  
>> conclusion while
>>     MM>         causing a little  interuption to "real" work by  user 
>> threads.     MM>         Various approaches to this were discussed, 
>> but no patches yet.
>>
>> For the scheduling of checkpoints, I was hoping Raymond would come up
>> with something.  Raymond are you still with us?
>>
>> I have discussed our I/O architecture with Solaris engineers, and our
>> approach of doing buffered writes followed by a fsync, I was told was
>> the worst approach on Solaris.  They recommended using direct I/O.  I
>> guess there will be situations were single-threaded direct I/O for
>> checkpointing will give too low throughput.  In that case, we could
>> consider a pool of writers.  The challenge would then be how to give
>> priority to user-initiated requests over multi-threaded checkpoint
>> writes as discussed above.
>>
> 
> 


Mime
View raw message