db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike Matrigali (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DERBY-6858) Apache Derby simple update statement performance becomes 1500% worse when adding one byte to a column
Date Sun, 21 Feb 2016 01:21:18 GMT

    [ https://issues.apache.org/jira/browse/DERBY-6858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15155838#comment-15155838
] 

Mike Matrigali commented on DERBY-6858:
---------------------------------------

I am coming late to this discussion.  mostly what i looked at were the 2 different plans presented
at the beginning of the issue.  Note that the plan info posted in this JIRA is incomplete
and hides important information that I did see tin the link below:
The formatting on this link make easier to read:  http://stackoverflow.com/questions/34874762/apache-derby-simple-update-statement-performance-becomes-1500-worse-when-adding
I am confused about that TESTINDEX is in the posted plans as i do not see that mentioned in
other discussions or in the posted java test code.

Also up front i am going to say i have no idea what affect the foreign key stuff it going
to have, if i was working on this I would eliminate it as much as possible to 
identify if that is a problem.  

What jumped out at me was that they were 2 different plans.  Often in derby if something is
order of magnitude problem it is often an optimizer issue.  
I think the test case is updating every row of the table.  

Note that while the query looks simple, it is actually one of the problem cases for a database.
 update ChildUpdate set parentName = 'Parent 2' where parentName = 'Parent 1' .  This
is updating the same data that it is looking at to perform the update.  Derby does not have
anyway to track what updates it has done in the table for the given transaction so it has
2
options 1) make sure it only visits each row once and does or does not do the update or 2)
do they updates and track the result in a side table and then apply them one by one after
making the list.  option 2 is referred as deferred update, and option 1 is just deferred update=false.
  

The first "fast" plan looks like it is not using any index and just doing the full scan of
the table and updating every row.  And if you know 
you are going to change every row that is probably the best you can do.  And it turns out
that if you pick that plan to find the rows you want to update then
you can also do updates in place (ie. just change the data in the table as you visit each
row).   The optmizer output says that it is going to do this in the line:  deferred: false

The second "slow" plan has chosen to use the index to find the rows it wants to look for.
 An unfortunate choice as we know it is going to have to visit all 2500 rows.   I think the
index
being used has name as the key.  But if there were 10 million rows and only 1 matching update
to do it would be much faster.  Using this index to do updates you must use "deferred" mode
because  if you update the key you might visit the row again later in the index (at least
in general case of updating same column you are filtering on).  

So changing all rows using deferred mode vs direct mode is going to always be slower.  Changing
some subset of the rows may be faster than full table scan, that is where optimizer 
comes in.

So some questions for the "real" application:
1) In real application do you always change all the rows with this update?  Does the test
case actually represent the likely real app scenario?
2) what is lifetime of data and size?  ie. do you load a bunch and then number and data distribution
stays stable?  

Now why the 1byte change affects all of this.   The optmizer uses costs to pick plans.  A
component of these costs is the size of the data.  My guess is that adding one byte made the
table have more pages which then changed the cost estimate to make the index lookup look better.
 The optmizer costing is a known problem and optimizer expertise on the project has been
a problem for some time now.  In this case I would not be surprised if the extra cost of doing
a deferred vs direct update is not calculated at all - it may be a different part of the code
looks
at the chosen plan and then decides if it can be deferred or not - i don't know.

Some things to play with to understand more what is going on:
1) get sizes of the underlying tables and indexes in the good and bad cases.  https://db.apache.org/derby/docs/10.12/ref/rrefsyscsdiagspacetable.html
2) see if updating optimizer statistics helps.  Derby will eventually automatically do this
in background.  But for your test case it seems likely you created
    indexes first and loaded data which will not generate stats.  : https://db.apache.org/derby/docs/10.8/tuning/ctunperfstatistics.html
3) i need some help from others on this one, but you may be able to add hints to your update
query  if you want to work around this problem and force the scan.
4) you might try writing your query with ? parameters vs hard coded variables.  This takes
a different path through the optimizer.  I am not hopeful on this one as
    i think it will lean more toward index scan vs table scan.

> Apache Derby simple update statement performance becomes 1500% worse when adding one
byte to a column
> -----------------------------------------------------------------------------------------------------
>
>                 Key: DERBY-6858
>                 URL: https://issues.apache.org/jira/browse/DERBY-6858
>             Project: Derby
>          Issue Type: Bug
>    Affects Versions: 10.11.1.1, 10.12.1.1
>         Environment: windows 7 64 bit
>            Reporter: somebody
>            Priority: Blocker
>         Attachments: repro.java, repro.java, repro.java
>
>
> I have 2 tables as follows:
> ParentUpdate
> name varchar(255)
> value int not null
> primary key: name
> ChildUpdate
> parentName varchar(255)
> name varchar(255)
> value int
> data varchar(1000)
> primary key: name foreign key: parentName to ParentUpdate.name
> When I run the statement "update ChildUpdate set parentName = 'Parent 2' where parentName
= 'Parent 1'" with 2500 records in the ChildUpdate table and 1 record in the ParentUpdate
table with only a single byte difference in data size in the ChildUpdate table, the performance
decreases by 15 times.
> When the ChildUpdate data column has exactly 14 bytes of the same character the runtime
of the above query is about 500 milliseconds. When I add one more byte to the data column
of ChildUpdate the performance all of a sudden becomes about 7500 milliseconds.
> If i then decrease the data size back to 14 from 15 it's fast again. When i put it back
to 15 it's slow again. This is reproducible every time.
> Can you please help me figure out how to get the same fast performance without such seemingly
random behaviour.
> The query plans are below for both cases.
>         projection = true
>             constructor time (milliseconds) = 0
>             open time (milliseconds) = 0
>             next time (milliseconds) = 16
>             close time (milliseconds) = 16
>             restriction time (milliseconds) = 0
>             projection time (milliseconds) = 0
>             optimizer estimated row count: 51.50
>             optimizer estimated cost: 796.12
>         Source result set:
>             Table Scan ResultSet for CHILDUPDATE at read committed isolation level using
exclusive row locking chosen by the optimizer
>             Number of opens = 1
>             Rows seen = 2500
>             Rows filtered = 0
>             Fetch Size = 1
>                 constructor time (milliseconds) = 0
>                 open time (milliseconds) = 15
>                 next time (milliseconds) = 16
>                 close time (milliseconds) = 16
>                 next time in milliseconds/row = 0
>             scan information:
>                 Bit set of columns fetched={0, 1}
>                 Number of columns fetched=2
>                 Number of pages visited=41
>                 Number of rows qualified=2500
>                 Number of rows visited=2500
>                 Scan type=heap
>                 start position:
>                     null
>                 stop position:
>                     null
>                 qualifiers:
>                     Column[0][0] Id: 0
>                     Operator: =
>                     Ordered nulls: false
>                     Unknown return value: false
>                     Negate comparison result: false
>                 optimizer estimated row count: 51.50
>                 optimizer estimated cost: 796.12
> total time: ~500 milliseconds
> and the slow version
>    Statement Name: 
>     null
> Statement Text: 
>     update ChildUpdate set parentName = 'Parent 2' where parentName = 'Parent 1'
> Parse Time: 0
> Bind Time: 0
> Optimize Time: 0
> Generate Time: 0
> Compile Time: 0
> Execute Time: -1453199485700
> Begin Compilation Timestamp : 2016-01-19 05:31:25.684
> End Compilation Timestamp : 2016-01-19 05:31:25.684
> Begin Execution Timestamp : 2016-01-19 05:31:25.7
> End Execution Timestamp : 2016-01-19 05:31:33.141
> Statement Execution Plan Text: 
> Update ResultSet using row locking:
> deferred: true
> Rows updated = 2500
> Indexes updated = 2
> Execute Time = -1453199485747
>     Normalize ResultSet:
>     Number of opens = 1
>     Rows seen = 2500
>         constructor time (milliseconds) = 0
>         open time (milliseconds) = 0
>         next time (milliseconds) = 47
>         close time (milliseconds) = 0
>         optimizer estimated row count: 51.50
>         optimizer estimated cost: 810.94
>     Source result set:
>         Project-Restrict ResultSet (3):
>         Number of opens = 1
>         Rows seen = 2500
>         Rows filtered = 0
>         restriction = false
>         projection = true
>             constructor time (milliseconds) = 0
>             open time (milliseconds) = 0
>             next time (milliseconds) = 32
>             close time (milliseconds) = 0
>             restriction time (milliseconds) = 0
>             projection time (milliseconds) = 0
>             optimizer estimated row count: 51.50
>             optimizer estimated cost: 810.94
>         Source result set:
>             Project-Restrict ResultSet (2):
>             Number of opens = 1
>             Rows seen = 2500
>             Rows filtered = 0
>             restriction = false
>             projection = true
>                 constructor time (milliseconds) = 0
>                 open time (milliseconds) = 0
>                 next time (milliseconds) = 32
>                 close time (milliseconds) = 0
>                 restriction time (milliseconds) = 0
>                 projection time (milliseconds) = 0
>                 optimizer estimated row count: 51.50
>                 optimizer estimated cost: 810.94
>             Source result set:
>                 Index Scan ResultSet for CHILDUPDATE using index TESTINDEX at read committed
isolation level using exclusive row locking chosen by the optimizer
>                 Number of opens = 1
>                 Rows seen = 2500
>                 Rows filtered = 0
>                 Fetch Size = 1
>                     constructor time (milliseconds) = 0
>                     open time (milliseconds) = 0
>                     next time (milliseconds) = 32
>                     close time (milliseconds) = 0
>                     next time in milliseconds/row = 0
>                 scan information:
>                     Bit set of columns fetched={0, 1, 2}
>                     Number of columns fetched=3
>                     Number of deleted rows visited=0
>                     Number of pages visited=42
>                     Number of rows qualified=2500
>                     Number of rows visited=2500
>                     Scan type=btree
>                     Tree height=2
>                     start position:
>                         None
>                     stop position:
>                         None
>                     qualifiers:
>                         Column[0][0] Id: 1
>                         Operator: =
>                         Ordered nulls: false
>                         Unknown return value: false
>                         Negate comparison result: false
>                     optimizer estimated row count: 51.50
>                     optimizer estimated cost: 810.94
> total time: ~7 seconds 500 milliseconds
> please also see post:
> http://stackoverflow.com/questions/34874762/apache-derby-simple-update-statement-performance-becomes-1500-worse-when-adding



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message