Mailing-List: contact derby-dev-help@db.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Derby Development" <derby-dev@db.apache.org>
Received-SPF: neutral (hermes.apache.org: local policy)
Message-ID: <42A9CE92.7070808@sbcglobal.net>
Date: Fri, 10 Jun 2005 10:32:02 -0700
From: Mike Matrigali <mikem_app@sbcglobal.net>
User-Agent: Mozilla Thunderbird 1.0 (Windows/20041206)
MIME-Version: 1.0
To: Derby Development <derby-dev@db.apache.org>
Subject: Re: [PATCH] DERBY-302 Insert on clob of 500k of data takes long time
References: <42A9C477.7020909@gmail.com>
In-Reply-To: <42A9C477.7020909@gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

all the code looks right to me, but I have some questions that may be
nit performance issues.

1) should most of the new routines you added be "final" - except for the
SQLChar one which gets overridden.
2) There are a lot of numbers rather than names for numbers, I realize 
you are following what was there.  but just a little hard to read and
if any of the numbers relate to each other it is hard to tell.
3) and sort of part of 2 - should the grow by number relate to the 
number which decides when to return a buffer back to the VM rather than
cache the buffer in the datatype.  This code is around because on a
select of 1 million rows derby will only allocate one SQLChar per CHAR
collumn - not 1 million.  I think the case I am wondering about is a
VARCHAR bigger than 64 bytes but less than 4k.  Does grow by 4k mean
the varchar buffer will start at 64 and then go to 64+4k - if so seems
like we should cache the buffer at 64+4k not 4k.

Sunitha Kambhampati wrote:
>   DERBY-302 - Insert on Clob of 500k of data using streams takes long 
> time. It takes 3 mins on a sun jvm1.4.2 and 20 seconds with ibm jvm 1.4.2. 
> The following fix improves the performance of insert into a 500k blob 
> from 20 seconds to around 1 second.  Note that by changing just the test 
> program  time was reduced from 3 minutes to avg around 20 seconds.
> 
> Currently in derby,  for an insert on a clob using setCharacterStream 
> what will happen is , the entire stream will be materialized into a char 
> array and sent to store for the insert.  ( we should not have to stream 
> here. I will file another jira issue for this and put in all information 
> I learnt)
> 
> Given this is how inserts for large clobs are happening, the performance 
> issue analysis is as follows:
> --  profiler run shows that most time is spent in SQLChar.readExternal 
> which is where the materialization into a char array for the user's 
> input stream happens.  The growth of this array happens gradually till 
> the entire stream is materialized into the array.  Below code snippet 
> shows by how much the array is grown each time when it realizes it has 
> to read more bytes from the stream.
>  
> The dvd hierarchy for clob is  -  SQLClob ( dvd) extends SQLVarChar 
> extends SQLChar.
> 
> So in SQLChar.readExternal
> ........
>     int growby = in.available();
>    if(growby < 64)
>        growby = 64
>  and then an allocation and an arraycopy to the new allocated array.
> 
> --  In the code snippet,  'in' is the wrapper around the user's stream 
> which is ReaderToUTF8Stream .   ReaderToUTF8Stream extends InputStream 
> and  does not override available() method . As per the spec, 
> InputStream.available() returns 0.
> 
> -- Thus each time, the array growth is by 64 bytes which is obviously 
> not performant.  so for a 500k clob insert, this would mean allocation & 
> arraycopy steps happen  ~8000 times.
> 
> -- The ReaderToUTF8Stream that has the user's stream reads from the 
> stream and does the utf8 conversion and puts it in a 4k array.  I think 
> it is reasonable to have a 32k buffer to store this information for clobs.
> 
> Although I think there seems to be more possible optimizations in this 
> area,  I prefer the incremental approach too :)  so this patch  is a 
> first step towards fixing the insert clob performance in the current 
> system.
> 
> Fix includes:
> -- enhanced the way the array was grown to keep the original  64 bytes 
> for char ( seems reasonable given the upper limit for char) but override 
> it to have  4k for varchar and clobs.
> -- override available() in ReaderToUTF8Stream to return a better 
> estimate of how many bytes can be read.
> 
> svn stat
> M      java\engine\org\apache\derby\impl\jdbc\ReaderToUTF8Stream.java
> M      java\engine\org\apache\derby\iapi\services\io\LimitReader.java
> M      java\engine\org\apache\derby\iapi\types\SQLChar.java
> M      java\engine\org\apache\derby\iapi\types\SQLVarchar.java
> 
> -- ran derbyall ok with sun jvm.
> 
> --  I can add a test and compare times but I realize that is probably 
> not the best solution here.  It should ideally be part of a performance 
> regression suite.
> 
> Numbers for clob inserts in seconds for one insert  on my laptop  - as 
> per the jira issue.
> With fix , times are in seconds for 1 insert on a clob on my laptop 
> (windows, 1G ram, 1.6Ghz Intel Pentium(M) )
> 
> FileSize            ibm jvm 1.4.2      sun jvm 1.4.2         sun jvm 1.5
> 500k               0.9s                     1.6s                        1.7s
> 1M                  2.1s                     4s                           5s
> 2M                  3s                        9s                        
>    11s
> 4M                  7s                        18s                       
>   22s
> 
> 
> Without the fix, 500k with sun jvm takes 3 mins and ibm jvm takes 20 
> seconds.
> I will add the test program along with the input files to jira issue.
> _________________________
> Without this fix :  As I already mentioned in the jira comment for 
> derby302,  I changed the program in the attached jira entry to use 
> BufferedReader with the buffersize set to a bigger value than the 
> default ( to 64k) brought down the times for sun jvm closer to ibm jvm.  
> I noticed that in my test, if I ran the test multiple times and did 
> multiple inserts the performance of sun jvm and ibm jvm for 500k clob 
> was around 20 seconds - guess the jit kicks in , plus the OS cache may 
> also be a factor..
> ________________________
> Please review it and if there are no comments, can a committer please 
> commit it.
> 
> Thanks,
> Sunitha.
> 
> 
> ------------------------------------------------------------------------
> 
> Index: java/engine/org/apache/derby/impl/jdbc/ReaderToUTF8Stream.java
> ===================================================================
> --- java/engine/org/apache/derby/impl/jdbc/ReaderToUTF8Stream.java	(revision 189773)
> +++ java/engine/org/apache/derby/impl/jdbc/ReaderToUTF8Stream.java	(working copy)
> @@ -47,7 +47,7 @@
>  	ReaderToUTF8Stream(LimitReader reader)
>  	{
>  		this.reader = reader;
> -		buffer = new byte[4096];
> +		buffer = new byte[32768];// seems reasonable to do 32k reads 
>  		blen = -1;
>  	}
>  
> @@ -196,5 +196,19 @@
>  		reader.close();
>  	}
>  
> +    /**
> +     * Return an optimized version of bytes available to read from 
> +     * the stream 
> +     */
> +    public int available()
> +    {
> +       int remainingBytes = reader.getLimit();
> +       // this object buffers (32k - 6)bytes that can be read 
> +       // and when that is finished it reads the next available bytes
> +       // from the reader object 
> +       // reader.getLimit() returns the remaining bytes available
> +       // on this stream
> +       return (32768 > remainingBytes ? remainingBytes : 32768);
> +    }
>  }
>  
> Index: java/engine/org/apache/derby/iapi/services/io/LimitReader.java
> ===================================================================
> --- java/engine/org/apache/derby/iapi/services/io/LimitReader.java	(revision 189773)
> +++ java/engine/org/apache/derby/iapi/services/io/LimitReader.java	(working copy)
> @@ -119,6 +119,16 @@
>  		limitInPlace = true;
>  		return;
>  	}
> +    
> +    /**
> +     * return limit of the stream that can be read without throwing
> +     * EOFException
> +     * @return the remaining bytes left to be read from the stream
> +     */
> +    public int getLimit()
> +    {
> +        return remainingBytes;
> +    }
>  
>  	/**
>  		Clear any limit set by setLimit. After this call no limit checking
> Index: java/engine/org/apache/derby/iapi/types/SQLChar.java
> ===================================================================
> --- java/engine/org/apache/derby/iapi/types/SQLChar.java	(revision 189773)
> +++ java/engine/org/apache/derby/iapi/types/SQLChar.java	(working copy)
> @@ -641,6 +641,11 @@
>          int utflen = in.readUnsignedShort();
>  
>          int requiredLength;
> +        // minimum amount that is reasonable to grow the array
> +        // when we know the array needs to growby at least one
> +        // byte but we dont want to grow by one byte as that
> +        // is not performant
> +        int minGrowBy = growBy();
>          if (utflen != 0)
>          {
>              // the object was not stored as a streaming column 
> @@ -654,8 +659,8 @@
>              // OR
>              // The original string was a 0 length string.
>              requiredLength = in.available();
> -            if (requiredLength < 64)
> -                requiredLength = 64;
> +            if (requiredLength < minGrowBy)
> +                requiredLength = minGrowBy;
>          }
>  
>          char str[];
> @@ -707,12 +712,26 @@
>              if (strlen >= arrayLength) // the char array needs to be grown 
>              {
>                  int growby = in.available();
> -
> -                // We know at the array needs to be grown by at least one.
> +                // We know that the array needs to be grown by at least one.
>                  // However, even if the input stream wants to block on every
>                  // byte, we don't want to grow by a byte at a time.
> -                if (growby < 64)
> -                    growby = 64;
> +                // Note, for large data (clob > 32k), it is performant
> +                // to grow the array by atleast 4k rather than a small amount
> +                // Even better maybe to grow by 32k but then may be
> +                // a little excess(?) for small data. 
> +                // hopefully in.available() will give a fair
> +                // estimate of how much data can be read to grow the 
> +                // array by larger and necessary chunks.
> +                // This performance issue due to 
> +                // the slow growth of this array was noticed since inserts
> +                // on clobs was taking a really long time as
> +                // the array here grew previously by 64 bytes each time 
> +                // till stream was drained.  (Derby-302)
> +                // for char, growby 64 seems reasonable, but for varchar
> +                // clob 4k or 32k is performant and hence
> +                // growBy() is override correctly to ensure this
> +                if (growby < minGrowBy)
> +                    growby = minGrowBy;
>  
>                  int newstrlength = arrayLength + growby;
>                  char oldstr[] = str;
> @@ -804,6 +823,18 @@
>          cKey = null;
>      }
>  
> +    /**
> +     * returns the reasonable minimum amount by 
> +     * which the array can grow . See readExternal. 
> +     * when we know that the array needs to grow by at least
> +     * one byte, it is not performant to grow by just one byte
> +     * instead this amount is used to provide a resonable growby size.
> +     * @return minimum reasonable growby size
> +     */
> +    protected int growBy()
> +    {
> +        return 64;  //seems reasonable for a char
> +    }
>  	/**
>  	 * @see Storable#restoreToNull
>  	 *
> Index: java/engine/org/apache/derby/iapi/types/SQLVarchar.java
> ===================================================================
> --- java/engine/org/apache/derby/iapi/types/SQLVarchar.java	(revision 189773)
> +++ java/engine/org/apache/derby/iapi/types/SQLVarchar.java	(working copy)
> @@ -182,4 +182,17 @@
>  	{
>  		return TypeId.VARCHAR_PRECEDENCE;
>  	}
> +    
> +    /**
> +     * returns the reasonable minimum amount by 
> +     * which the array can grow . See readExternal. 
> +     * when we know that the array needs to grow by at least
> +     * one byte, it is not performant to grow by just one byte
> +     * instead this amount is used to provide a resonable growby size.
> +     * @return minimum reasonable growby size
> +     */
> +    protected int growBy()
> +    {
> +        return 4096;  //seems reasonable for a varchar or clob 
> +    }
>  }