db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sunitha Kambhampati <ksunitha...@gmail.com>
Subject [PATCH] DERBY-302 Insert on clob of 500k of data takes long time
Date Fri, 10 Jun 2005 16:48:55 GMT
DERBY-302 - Insert on Clob of 500k of data using streams takes long 
time. It takes 3 mins on a sun jvm1.4.2 and 20 seconds with ibm jvm 1.4.2. 
The following fix improves the performance of insert into a 500k blob 
from 20 seconds to around 1 second.  Note that by changing just the test 
program  time was reduced from 3 minutes to avg around 20 seconds.

Currently in derby,  for an insert on a clob using setCharacterStream 
what will happen is , the entire stream will be materialized into a char 
array and sent to store for the insert.  ( we should not have to stream 
here. I will file another jira issue for this and put in all information 
I learnt)

Given this is how inserts for large clobs are happening, the performance 
issue analysis is as follows:
--  profiler run shows that most time is spent in SQLChar.readExternal 
which is where the materialization into a char array for the user's 
input stream happens.  The growth of this array happens gradually till 
the entire stream is materialized into the array.  Below code snippet 
shows by how much the array is grown each time when it realizes it has 
to read more bytes from the stream.
 
The dvd hierarchy for clob is  -  SQLClob ( dvd) extends SQLVarChar 
extends SQLChar.

So in SQLChar.readExternal
........
    int growby = in.available();
   if(growby < 64)
       growby = 64
 and then an allocation and an arraycopy to the new allocated array.

--  In the code snippet,  'in' is the wrapper around the user's stream 
which is ReaderToUTF8Stream .   ReaderToUTF8Stream extends InputStream 
and  does not override available() method . As per the spec, 
InputStream.available() returns 0.

-- Thus each time, the array growth is by 64 bytes which is obviously 
not performant.  so for a 500k clob insert, this would mean allocation & 
arraycopy steps happen  ~8000 times.

-- The ReaderToUTF8Stream that has the user's stream reads from the 
stream and does the utf8 conversion and puts it in a 4k array.  I think 
it is reasonable to have a 32k buffer to store this information for clobs.

Although I think there seems to be more possible optimizations in this 
area,  I prefer the incremental approach too :)  so this patch  is a 
first step towards fixing the insert clob performance in the current 
system.

Fix includes:
-- enhanced the way the array was grown to keep the original  64 bytes 
for char ( seems reasonable given the upper limit for char) but override 
it to have  4k for varchar and clobs.
-- override available() in ReaderToUTF8Stream to return a better 
estimate of how many bytes can be read.

svn stat
M      java\engine\org\apache\derby\impl\jdbc\ReaderToUTF8Stream.java
M      java\engine\org\apache\derby\iapi\services\io\LimitReader.java
M      java\engine\org\apache\derby\iapi\types\SQLChar.java
M      java\engine\org\apache\derby\iapi\types\SQLVarchar.java

-- ran derbyall ok with sun jvm.

--  I can add a test and compare times but I realize that is probably 
not the best solution here.  It should ideally be part of a performance 
regression suite.

Numbers for clob inserts in seconds for one insert  on my laptop  - as 
per the jira issue.
With fix , times are in seconds for 1 insert on a clob on my laptop 
(windows, 1G ram, 1.6Ghz Intel Pentium(M) )

FileSize            ibm jvm 1.4.2      sun jvm 1.4.2         sun jvm 1.5
500k               0.9s                     1.6s                        1.7s
1M                  2.1s                     4s                           5s
2M                  3s                        9s                        
   11s
4M                  7s                        18s                       
  22s


Without the fix, 500k with sun jvm takes 3 mins and ibm jvm takes 20 
seconds.
I will add the test program along with the input files to jira issue.
_________________________
Without this fix :  As I already mentioned in the jira comment for 
derby302,  I changed the program in the attached jira entry to use 
BufferedReader with the buffersize set to a bigger value than the 
default ( to 64k) brought down the times for sun jvm closer to ibm jvm.  
I noticed that in my test, if I ran the test multiple times and did 
multiple inserts the performance of sun jvm and ibm jvm for 500k clob 
was around 20 seconds - guess the jit kicks in , plus the OS cache may 
also be a factor..
________________________
Please review it and if there are no comments, can a committer please 
commit it.

Thanks,
Sunitha.

Mime
View raw message