db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oystein.Grov...@Sun.COM (Øystein Grøvlen)
Subject Derby I/O issues during checkpointing
Date Mon, 31 Oct 2005 21:52:01 GMT

Some tests runs we have done show very long transaction response times
during checkpointing.  This has been seen on several platforms.  The
load is TPC-B like transactions and the write cache is turned off so
the system is I/O bound.  There seems to be two major issues:

1. Derby does checkpointing by writing all dirty pages by
   RandomAccessFile.write() and then do file sync when the entire
   cache has been scanned.  When the page cache is large, the file
   system buffer will overflow during checkpointing, and occasionally
   the writes will take very long.  I have observed single write
   operations that took almost 12 seconds.  What is even worse is that
   during this period also read performance on other files can be very
   bad.  For example, reading an index page from disk can take close
   to 10 seconds when the base table is checkpointed.  Hence,
   transactions are severely slowed down.

   I have managed to improve response times by flushing every file for
   every 100th write.  Is this something we should consider including
   in the code?  Do you have better suggestions?

2. What makes thing even worse is that only a single thread can read a
   page from a file at a time.  (Note that Derby has one file per
   table). This is because the implementation of RAFContainer.readPage
   is as follow:

        synchronized (this) {  // 'this' is a FileContainer, i.e. a file object
            fileData.seek(pageOffset);  // fileData is a RandomAccessFile
            fileData.readFully(pageData, 0, pageSize);
	}

   During checkpoint when I/O is slow this creates long queques of
   readers.  In my run with 20 clients, I observed read requests that
   took more than 20 seconds.

   This behavior will also limit throughput and can partly explains
   why I get low CPU utilization with 20 clients.  All my TPCB-B
   clients are serialized since most will need 1-2 disk accesses
   (index leaf page and one page of the account table).

   Generally, in order to make the OS able to optimize I/O, one should
   have many outstanding I/O calls at a time.  (See Frederiksen,
   Bonnet: "Getting Priorities Straight: Improving Linux Support for
   Database I/O", VLDB 2005).  

   I have attached a patch where I have introduced several file
   descriptors (RandomAccessFile objects) per RAFContainer.  These are
   used for reading.  The principle is that when all readers are busy,
   a readPage request will create a new reader.  (There is a maximum
   number of readers.)  With this patch, throughput was improved by
   50% on linux.  The combination of this patch and the synching for
   every 100th write, reduced maximum transaction response times with
   90%.

   The patch is not ready for inclusion into Derby, but I would like
   to here whether you think this is a viable approach.

-- 
Øystein

Index: java/engine/org/apache/derby/impl/store/raw/data/RAFContainer.java
===================================================================
--- java/engine/org/apache/derby/impl/store/raw/data/RAFContainer.java   (revision 312819)
+++ java/engine/org/apache/derby/impl/store/raw/data/RAFContainer.java (working copy)
@@ -45,7 +45,8 @@
 import org.apache.derby.io.StorageFile;
 import org.apache.derby.io.StorageRandomAccessFile;
 
-import java.util.Vector;
+import java.util.ArrayList;
+import java.util.List;
 
 import java.io.DataInput;
 import java.io.IOException;
@@ -66,12 +67,15 @@
       * Immutable fields
      */
     protected StorageRandomAccessFile fileData;
-
+        
   /* 
     ** Mutable fields, only valid when the identity is valid.
       */
      protected boolean                       needsSync;
 
+    private int openReaders;
+    private List freeReaders;
+
     /* privileged actions */
     private int actionCode;
     private static final int GET_FILE_NAME_ACTION = 1;
@@ -79,6 +83,7 @@
     private static final int REMOVE_FILE_ACTION = 3;
     private static final int OPEN_CONTAINER_ACTION = 4;
     private static final int STUBBIFY_ACTION = 5;
+    private static final int OPEN_READONLY_ACTION = 6;
     private ContainerKey actionIdentity;
     private boolean actionStub;
     private boolean actionErrorOK;
@@ -86,12 +91,15 @@
     private StorageFile actionFile;
     private LogInstant actionInstant;
     
+
   /*
       * Constructors
          */
 
    RAFContainer(BaseDataFileFactory factory) {
             super(factory);
+        openReaders = 0;
+        freeReaders = new ArrayList();
         }
 
      /*
@@ -193,12 +201,25 @@
 
                long pageOffset = pageNumber * pageSize;
 
-              synchronized (this) {
+        
+        StorageRandomAccessFile reader = null;
+        for (;;) {
+            synchronized(freeReaders) {
+                if (freeReaders.size() > 0) {
+                    reader = (StorageRandomAccessFile)freeReaders.remove(0);
+                    break;
+                }
+            }
+            openNewReader();
+        } 
 
-                        fileData.seek(pageOffset);
 
-                    fileData.readFully(pageData, 0, pageSize);
-             }
+        reader.seek(pageOffset);
+        reader.readFully(pageData, 0, pageSize);
+        synchronized(freeReaders) {
+            freeReaders.add(reader);
+            freeReaders.notify();
+        }
 
               if (dataFactory.databaseEncrypted() &&
                  pageNumber != FIRST_ALLOC_PAGE_NUMBER)
@@ -769,6 +790,21 @@
         finally{ actionIdentity = null; }
     }
 
+
+   synchronized boolean openNewReader()
+        throws StandardException
+    {
+        actionCode = OPEN_READONLY_ACTION;
+         actionIdentity = (ContainerKey)getIdentity();
+        try
+        {
+            return AccessController.doPrivileged( this) != null;
+        }
+        catch( PrivilegedActionException pae){ throw (StandardException) pae.getException();}
+        finally{ actionIdentity = null; }
+    }
+
+
  private synchronized void stubbify(LogInstant instant)
         throws StandardException
         {
@@ -1112,6 +1148,52 @@
              dataFactory.stubFileToRemoveAfterCheckPoint(stub,actionInstant, getIdentity());
              return null;
          } // end of case STUBBIFY_ACTION
+         case OPEN_READONLY_ACTION:
+         {
+             try {
+                 synchronized(freeReaders) {
+                     if (openReaders > 20) {
+                         freeReaders.wait();
+                         return null;
+                     } else {
+                         ++openReaders;
+                     }
+                 }
+             } catch (InterruptedException ie) {
+                 throw StandardException.newException(
+                     SQLState.DATA_UNEXPECTED_EXCEPTION, ie);
+             }
+             
+             StorageFile file = privGetFileName(actionIdentity, false, true, true);
+             if (file == null)
+                 return null;
+
+             try {
+                 if (!file.exists()) {
+                     return null;
+                 }
+             } catch (SecurityException se) {
+                 throw StandardException.newException(
+                     SQLState.DATA_UNEXPECTED_EXCEPTION, se);
+             }
+
+             try {
+
+                 StorageRandomAccessFile reader = file.getRandomAccessFile("r");
+                 synchronized(freeReaders) {
+                     freeReaders.add(reader);
+                 }
+//                 SanityManager.DEBUG_PRINT("RAFContainer", "Opens reader no. " + openReaders);
+
+             } catch (IOException ioe) {
+                 throw dataFactory.markCorrupt(
+                     StandardException.newException(
+                         SQLState.FILE_CONTAINER_EXCEPTION, ioe, this));
+             }
+
+             return this;
+         } // end of case OPEN_CONTAINER_ACTION
+
          }
          return null;
      } // end of run


Mime
View raw message