Mailing-List: contact derby-dev-help@db.apache.org; run by ezmlm
Precedence: bulk
Reply-To: <derby-dev@db.apache.org>
Received-SPF: neutral (asf.osuosl.org: local policy)
Message-ID: <4367A39E.2050507@debrunners.com>
Date: Tue, 01 Nov 2005 09:19:26 -0800
From: Daniel John Debrunner <djd@debrunners.com>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
 rv:1.7.3) Gecko/20040910
MIME-Version: 1.0
To: derby-dev@db.apache.org
Subject: Re: Derby I/O issues during checkpointing
References: <x1ofsluhl5ha.fsf@clustra.norway.sun.com>
In-Reply-To: <x1ofsluhl5ha.fsf@clustra.norway.sun.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit

�ystein Gr�vlen wrote:

> Some tests runs we have done show very long transaction response times
> during checkpointing.  This has been seen on several platforms.  The
> load is TPC-B like transactions and the write cache is turned off so
> the system is I/O bound.  There seems to be two major issues:

Nice investigation, I think I have seen similar problms on Windows.

> 1. Derby does checkpointing by writing all dirty pages by
>    RandomAccessFile.write() and then do file sync when the entire
>    cache has been scanned.  When the page cache is large, the file
>    system buffer will overflow during checkpointing, and occasionally
>    the writes will take very long.  I have observed single write
>    operations that took almost 12 seconds.  What is even worse is that
>    during this period also read performance on other files can be very
>    bad.  For example, reading an index page from disk can take close
>    to 10 seconds when the base table is checkpointed.  Hence,
>    transactions are severely slowed down.
> 
>    I have managed to improve response times by flushing every file for
>    every 100th write.  Is this something we should consider including
>    in the code?  Do you have better suggestions?

Sounds reasonable.


> 
> 2. What makes thing even worse is that only a single thread can read a
>    page from a file at a time.  (Note that Derby has one file per
>    table). This is because the implementation of RAFContainer.readPage
>    is as follow:
> 
>         synchronized (this) {  // 'this' is a FileContainer, i.e. a file object
>             fileData.seek(pageOffset);  // fileData is a RandomAccessFile
>             fileData.readFully(pageData, 0, pageSize);
> 	}
> 
>    During checkpoint when I/O is slow this creates long queques of
>    readers.  In my run with 20 clients, I observed read requests that
>    took more than 20 seconds.


Hmmm, I think that code was written assuming the call would nat take
that long!


> 
>    This behavior will also limit throughput and can partly explains
>    why I get low CPU utilization with 20 clients.  All my TPCB-B
>    clients are serialized since most will need 1-2 disk accesses
>    (index leaf page and one page of the account table).
> 
>    Generally, in order to make the OS able to optimize I/O, one should
>    have many outstanding I/O calls at a time.  (See Frederiksen,
>    Bonnet: "Getting Priorities Straight: Improving Linux Support for
>    Database I/O", VLDB 2005).  
> 
>    I have attached a patch where I have introduced several file
>    descriptors (RandomAccessFile objects) per RAFContainer.  These are
>    used for reading.  The principle is that when all readers are busy,
>    a readPage request will create a new reader.  (There is a maximum
>    number of readers.)  With this patch, throughput was improved by
>    50% on linux.  The combination of this patch and the synching for
>    every 100th write, reduced maximum transaction response times with
>    90%.

Only concern would be number of open file descriptors as others have
pointed out. Might want to scavenged open descriptors from containers
that are no longer heavily used.

>    The patch is not ready for inclusion into Derby, but I would like
>    to here whether you think this is a viable approach.

It seems like these changes are low risk and enable worthwhile
performance increases without completely changing the i/o system.
Such changes could then provide the performance that a full async
re-write would have to better (or at least match).

Dan.