From derby-dev-return-9972-apmail-db-derby-dev-archive=db.apache.org@db.apache.org Mon Oct 31 21:52:27 2005 Return-Path: Delivered-To: apmail-db-derby-dev-archive@www.apache.org Received: (qmail 8703 invoked from network); 31 Oct 2005 21:52:26 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 31 Oct 2005 21:52:26 -0000 Received: (qmail 47523 invoked by uid 500); 31 Oct 2005 21:52:25 -0000 Delivered-To: apmail-db-derby-dev-archive@db.apache.org Received: (qmail 47503 invoked by uid 500); 31 Oct 2005 21:52:25 -0000 Mailing-List: contact derby-dev-help@db.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: Delivered-To: mailing list derby-dev@db.apache.org Received: (qmail 47487 invoked by uid 99); 31 Oct 2005 21:52:25 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 31 Oct 2005 13:52:25 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=UNPARSEABLE_RELAY X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [192.18.98.36] (HELO brmea-mail-4.sun.com) (192.18.98.36) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 31 Oct 2005 13:52:21 -0800 Received: from phys-epost-1 ([129.159.136.14]) by brmea-mail-4.sun.com (8.12.10/8.12.9) with ESMTP id j9VLq2D7009877 for ; Mon, 31 Oct 2005 14:52:03 -0700 (MST) Received: from conversion-daemon.epost-mail1.sweden.sun.com by epost-mail1.sweden.sun.com (iPlanet Messaging Server 5.2 HotFix 1.24 (built Dec 19 2003)) id <0IP800L01VDR7P@epost-mail1.sweden.sun.com> (original mail from Oystein.Grovlen@Sun.COM) for derby-dev@db.apache.org; Mon, 31 Oct 2005 22:52:02 +0100 (MET) Received: from clustra.norway.sun.com.sun.com (clustra.Norway.Sun.COM [129.159.119.10]) by epost-mail1.sweden.sun.com (iPlanet Messaging Server 5.2 HotFix 1.24 (built Dec 19 2003)) with ESMTP id <0IP80067LVEPIR@epost-mail1.sweden.sun.com> for derby-dev@db.apache.org; Mon, 31 Oct 2005 22:52:02 +0100 (MET) Date: Mon, 31 Oct 2005 22:52:01 +0100 From: Oystein.Grovlen@Sun.COM (=?iso-8859-1?q?=D8ystein_Gr=F8vlen?=) Subject: Derby I/O issues during checkpointing To: derby-dev@db.apache.org Message-id: MIME-version: 1.0 Content-type: text/plain; charset=iso-8859-1 Content-transfer-encoding: 8BIT User-Agent: Gnus/5.0808 (Gnus v5.8.8) Emacs/21.3 Lines: 220 X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Some tests runs we have done show very long transaction response times during checkpointing. This has been seen on several platforms. The load is TPC-B like transactions and the write cache is turned off so the system is I/O bound. There seems to be two major issues: 1. Derby does checkpointing by writing all dirty pages by RandomAccessFile.write() and then do file sync when the entire cache has been scanned. When the page cache is large, the file system buffer will overflow during checkpointing, and occasionally the writes will take very long. I have observed single write operations that took almost 12 seconds. What is even worse is that during this period also read performance on other files can be very bad. For example, reading an index page from disk can take close to 10 seconds when the base table is checkpointed. Hence, transactions are severely slowed down. I have managed to improve response times by flushing every file for every 100th write. Is this something we should consider including in the code? Do you have better suggestions? 2. What makes thing even worse is that only a single thread can read a page from a file at a time. (Note that Derby has one file per table). This is because the implementation of RAFContainer.readPage is as follow: synchronized (this) { // 'this' is a FileContainer, i.e. a file object fileData.seek(pageOffset); // fileData is a RandomAccessFile fileData.readFully(pageData, 0, pageSize); } During checkpoint when I/O is slow this creates long queques of readers. In my run with 20 clients, I observed read requests that took more than 20 seconds. This behavior will also limit throughput and can partly explains why I get low CPU utilization with 20 clients. All my TPCB-B clients are serialized since most will need 1-2 disk accesses (index leaf page and one page of the account table). Generally, in order to make the OS able to optimize I/O, one should have many outstanding I/O calls at a time. (See Frederiksen, Bonnet: "Getting Priorities Straight: Improving Linux Support for Database I/O", VLDB 2005). I have attached a patch where I have introduced several file descriptors (RandomAccessFile objects) per RAFContainer. These are used for reading. The principle is that when all readers are busy, a readPage request will create a new reader. (There is a maximum number of readers.) With this patch, throughput was improved by 50% on linux. The combination of this patch and the synching for every 100th write, reduced maximum transaction response times with 90%. The patch is not ready for inclusion into Derby, but I would like to here whether you think this is a viable approach. -- Øystein Index: java/engine/org/apache/derby/impl/store/raw/data/RAFContainer.java =================================================================== --- java/engine/org/apache/derby/impl/store/raw/data/RAFContainer.java (revision 312819) +++ java/engine/org/apache/derby/impl/store/raw/data/RAFContainer.java (working copy) @@ -45,7 +45,8 @@ import org.apache.derby.io.StorageFile; import org.apache.derby.io.StorageRandomAccessFile; -import java.util.Vector; +import java.util.ArrayList; +import java.util.List; import java.io.DataInput; import java.io.IOException; @@ -66,12 +67,15 @@ * Immutable fields */ protected StorageRandomAccessFile fileData; - + /* ** Mutable fields, only valid when the identity is valid. */ protected boolean needsSync; + private int openReaders; + private List freeReaders; + /* privileged actions */ private int actionCode; private static final int GET_FILE_NAME_ACTION = 1; @@ -79,6 +83,7 @@ private static final int REMOVE_FILE_ACTION = 3; private static final int OPEN_CONTAINER_ACTION = 4; private static final int STUBBIFY_ACTION = 5; + private static final int OPEN_READONLY_ACTION = 6; private ContainerKey actionIdentity; private boolean actionStub; private boolean actionErrorOK; @@ -86,12 +91,15 @@ private StorageFile actionFile; private LogInstant actionInstant; + /* * Constructors */ RAFContainer(BaseDataFileFactory factory) { super(factory); + openReaders = 0; + freeReaders = new ArrayList(); } /* @@ -193,12 +201,25 @@ long pageOffset = pageNumber * pageSize; - synchronized (this) { + + StorageRandomAccessFile reader = null; + for (;;) { + synchronized(freeReaders) { + if (freeReaders.size() > 0) { + reader = (StorageRandomAccessFile)freeReaders.remove(0); + break; + } + } + openNewReader(); + } - fileData.seek(pageOffset); - fileData.readFully(pageData, 0, pageSize); - } + reader.seek(pageOffset); + reader.readFully(pageData, 0, pageSize); + synchronized(freeReaders) { + freeReaders.add(reader); + freeReaders.notify(); + } if (dataFactory.databaseEncrypted() && pageNumber != FIRST_ALLOC_PAGE_NUMBER) @@ -769,6 +790,21 @@ finally{ actionIdentity = null; } } + + synchronized boolean openNewReader() + throws StandardException + { + actionCode = OPEN_READONLY_ACTION; + actionIdentity = (ContainerKey)getIdentity(); + try + { + return AccessController.doPrivileged( this) != null; + } + catch( PrivilegedActionException pae){ throw (StandardException) pae.getException();} + finally{ actionIdentity = null; } + } + + private synchronized void stubbify(LogInstant instant) throws StandardException { @@ -1112,6 +1148,52 @@ dataFactory.stubFileToRemoveAfterCheckPoint(stub,actionInstant, getIdentity()); return null; } // end of case STUBBIFY_ACTION + case OPEN_READONLY_ACTION: + { + try { + synchronized(freeReaders) { + if (openReaders > 20) { + freeReaders.wait(); + return null; + } else { + ++openReaders; + } + } + } catch (InterruptedException ie) { + throw StandardException.newException( + SQLState.DATA_UNEXPECTED_EXCEPTION, ie); + } + + StorageFile file = privGetFileName(actionIdentity, false, true, true); + if (file == null) + return null; + + try { + if (!file.exists()) { + return null; + } + } catch (SecurityException se) { + throw StandardException.newException( + SQLState.DATA_UNEXPECTED_EXCEPTION, se); + } + + try { + + StorageRandomAccessFile reader = file.getRandomAccessFile("r"); + synchronized(freeReaders) { + freeReaders.add(reader); + } +// SanityManager.DEBUG_PRINT("RAFContainer", "Opens reader no. " + openReaders); + + } catch (IOException ioe) { + throw dataFactory.markCorrupt( + StandardException.newException( + SQLState.FILE_CONTAINER_EXCEPTION, ioe, this)); + } + + return this; + } // end of case OPEN_CONTAINER_ACTION + } return null; } // end of run