db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Matrigali <mikem_...@sbcglobal.net>
Subject Re: [jira] Commented: (DERBY-733) Starvation in RAFContainer.readPage()
Date Fri, 16 Dec 2005 21:00:24 GMT
you are right, I'll have to think about this some more.  Until java
gets async, guaranteed sync'd to disk writes I think we should continue
to use the current method for user initiated writes.

Suresh Thalamati wrote:
> This might be obvious, thought I would mention it any way. My 
> understanding is one can not just enable "RWD" (direct io) for the 
> checkpoint. It has to be enabled for all the writes from the page cache, 
> otherwise a file sync is required before doing "rwd" writes because I am 
> not sure If  a file is opened in "rw" mode  and then in "rws" mode , 
> writes to first open will also get synced to the disk , when file is 
> opened in "rwd" mode , I doubt that.
> If files are opened in direct io mode always , then  page cache cleaning 
> can possible get slow and also user query request for a new page in 
> buffer pool can become slow if a cache is full and a page has to be 
> thrown out to get a free page.  Another thing to note is buffer cleaning 
> is done on Rawstore  daemon thread, which  is overloaded with some post 
> commit work also , so page cache may not get cleaned often in some cases.
> Thanks
> -suresht
> Mike Matrigali wrote:
>> excellent, I look forward to your work on concurrent I/O.  I am likely
>> to not be on the list much for the next 2 weeks, so won't be able to
>> help much.  In thinking about this issue I was hoping that somehow
>> the current container cache could be enhanced to support more than
>> one open container per container.  Then one would automatically get
>> control over the open file resource across all containers, by setting
>> the currently supported "max" on the container pool.
>> The challenge is that this would be a new concept for the basic services
>> cache implementation.  What we want is a cache that supports multiple
>> objects with the same key, and that returns an available one if another
>> one is "busy".  Also returns a newly opened one, if all are busy.  I
>> am going to start a thread on this, to see if any other help is
>> available.  If possible I like this approach better than having a 
>> queue of open files per container where it hard to control the growth 
>> of one queue vs. the growth in another.
>> On the checkpoint issue, I would not have a problem with changes to the
>> current mechanism to do "rwd" type sync I/O rather than sync at end 
>> (but we will have to support both until we don't have to support older 
>> versions of JVM's).  I believe this is as close
>> to "direct i/o" as we can get from java - if you mean something 
>> different here let me know.  The benefit is that I believe it will fix
>> the checkpoint flooding the I/O system problem.  The downside is that
>> it will cause total number of I/O's to increase in cases where the
>> derby block size is smaller than the filesystem/disk blocksize -- 
>> assuming the OS currently converts our flood of multiple async writes 
>> to the same file to a smaller number of bigger I/O's.  I think this 
>> trade off is fine for checkpoints.  If checkpoint efficiency is an 
>> issue, there are a number of other ways to address it in the future.
>> Øystein Grøvlen wrote:
>>>>>>>> "MM" == Mike Matrigali <mikem_app@sbcglobal.net> writes:
>>>     MM> user thread initiated read
>>>     MM>      o should  be high priority and  should be "fair"  with 
>>> other user
>>>     MM>        initiated reads.
>>>     MM>      o These happen anytime a read of a row causes a cache miss.
>>>     MM>      o Currently only one I/O operation to a file can happen 
>>> at a time,
>>>     MM>        could be big problem for some types of multi-threaded,
>>>     MM>        highly concurrent low number of table apps.  I think
>>>     MM>        the path here should be to increase the number of
>>>     MM>        concurrent I/O's allowed to be outstanding by allowing
>>>     MM>        each thread to have 1 (assuming sufficient open file
>>>     MM>        resources).  100 outstanding I/O's to a single file may
>>>     MM>        be overkill, but in java we can't know that the file is
>>>     MM>        not actually 100 disks underneath.  The number of I/O's
>>>     MM>        should grow as the actual application load increases,
>>>     MM>        note I still think max I/O's should be tied to number
>>>     MM>        of user threads, plus maybe a small number for
>>>     MM>        background processing.
>>> There was an interesting paper at the last VLDB conference that
>>> discussed the virtue of having many outstanding I/O requests:
>>>     http://www.vldb2005.org/program/paper/wed/p1116-hall.pdf (paper)
>>>     http://www.vldb2005.org/program/slides/wed/s1116-hall.pdf (slides)
>>> The basic message is that many outstanding requests are good.  The
>>> SCSI controller they used in their study was able to handle 32
>>> concurrent requests.  One reason database systems have been
>>> conservative with respect to outstanding requests is that they want to
>>> control the priority of the I/O requests.  We would like user thread
>>> initiated requests to have priority over checkpoint initiated writes.
>>> (The authors suggest building priorities into the file system to solve
>>> this.)
>>> I plan to start working on a patch for allowing more concurrency
>>> between readers within a few weeks.  The main challenge is to find the
>>> best way to organize the open file descriptors (reuse, limit the max.
>>> number etc.)  I will file a JIRA for this.
>>> I also think we should consider mechanisms for read ahead.
>>>     MM> user thread initiated write
>>>     MM>       o same issues as user initiated read.
>>>     MM>       o happens way less than read, as it should only happen 
>>> on a cache
>>>     MM>         miss that can't find a non-dirty page in the cache.  
>>> background
>>>     MM>         cache cleaner  should be  keeping this from  
>>> happening, though
>>>     MM>         apps that only do updates and cause cache hits are 
>>> worst case.
>>>     MM> checkpoint initiated write:
>>>     MM>       o sometimes too many checkpoints happen in too short a 
>>> time.
>>>     MM>       o needs an improved scheduling algorithm, currently 
>>> just defaults
>>>     MM>         to N number of bytes to the log file no matter what 
>>> the speed of
>>>     MM>         log writes are.
>>>     MM>       o currently may flood the I/O system causing user 
>>> reads/writes to
>>>     MM>         stall - on  some OS/JVM's this stall is  amazing like 
>>> ten's of
>>>     MM>         seconds.
>>>     MM>       o  It is not  important that  checkpoints run  fast, 
>>> it  is more
>>>     MM>         important  that it  prodede methodically  to  
>>> conclusion while
>>>     MM>         causing a little  interuption to "real" work by  user 
>>> threads.     MM>         Various approaches to this were discussed, 
>>> but no patches yet.
>>> For the scheduling of checkpoints, I was hoping Raymond would come up
>>> with something.  Raymond are you still with us?
>>> I have discussed our I/O architecture with Solaris engineers, and our
>>> approach of doing buffered writes followed by a fsync, I was told was
>>> the worst approach on Solaris.  They recommended using direct I/O.  I
>>> guess there will be situations were single-threaded direct I/O for
>>> checkpointing will give too low throughput.  In that case, we could
>>> consider a pool of writers.  The challenge would then be how to give
>>> priority to user-initiated requests over multi-threaded checkpoint
>>> writes as discussed above.

View raw message