db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "A B (JIRA)" <derby-...@db.apache.org>
Subject [jira] Commented: (DERBY-407) predicatesIntoViews test failure on slow machine in Derby 10.1 branch version 201931
Date Mon, 25 Jul 2005 18:11:37 GMT
    [ http://issues.apache.org/jira/browse/DERBY-407?page=comments#action_12316672 ] 

A B commented on DERBY-407:
---------------------------


If you trim out the extra lines of the diff, the failure comes down to the following difference
in query plan (this is just one tiny part of a very large query plan):

Master file --

  Left result set  = Table scan on CLASSIFICATION_VALUES (chosen by the optimizer)
  Right result set = Hash scan on REPOSITORYOBJECTRESOURCE

Actual (when the test fails) --

  Left result set = Index scan on REPOSITORYOBJECTRESOURCE (chosen by the optimizer)
  Right result set = Hash scan on CLASSIFICATION_VALUES

I don't know enough about how the optimizer works to know if one of these plans is inherently
"better" than the other (I think it depends on how many rows are in the underlying tables).
 However, I have confirmed (through discussion with Mike Matrigali) that the plan chosen by
the optimizer _can_ be affected by the speed of the machine.  Internally, the optimizer does
a check to make sure that it doesn't spend too much time trying to find the best query plan--and
it will do a "timeout" if it thinks it's taking too long.  In particular, if the optimizer
has already spent more time choosing a query plan than it thinks the best query plan so far
is going to take to execute, the optimizer will stop looking at plans and just use the best
plan so far.

What this means is that, on a faster machine, the optimizer can potentially see more query
plans before timing out than it would on a slower machine--and thus it might find a better
plan on the faster machine.

So far as I can tell, that's what's happening with this predicatesIntoViews failure.  While
trying to prove this point, I noticed that there is an undocumented user property called "derby.optimizer.noTimeout"
that keeps the optimizer from timing out--i.e. the optimizer will process ALL query plans
and then pick the best one, regardless of how long it takes to do so.  When I set this property
to true and ran the test, it passed 100% of the time, regardless of the speed of my CPU.

That said, I think an initial work-around for this problem is to set the derby.optimizer.noTimeout
property for this test to true (using the harness).  That should allow the test to pass on
a consistent basis until someone can look further at the massive query plan to see what's
happening and to determine if this particular failure is an optimizer bug (see notes below)
or is just 'working as designed'.


-----------
Other notes
-----------

While investigating this failure, I noticed a couple of behavioral oddities that could potentially
help to determine if this is an optimizer bug.  I'm including them here for tracking purposes.
 Note that these behaviors are only relevant if the noTimeout property is NOT set (if it _is_
set, then the test will always pass, regardless of what I describe below).

1) In an attempt to debug the failure, I added a bunch of "select count(*)" queries to the
test to see how large the tables used by the query plan were, and also to see how large the
result set of the query was.  As it turned out, adding such queries caused the test to consistently
choose the second query plan listed above--i.e. the "fail" plan--regardless of the speed of
my CPU.  The specific queries I added are these (I added them just before execution of the
query that's seeing the diff):

  SELECT COUNT(*) FROM XR.CLASSIFICATION_VALUES;
  SELECT COUNT(*) FROM XR.REPOSITORYOBJECTRESOURCE;
  SELECT COUNT(*) FROM XR.REPOSITORYOBJECTALLVERSIONVIEW;
  SELECT count(*)
  FROM xr.repositoryobjectallversionview rov 
  where (uname = UPPER('two') or uname = UPPER('my project'))
  and (versionid in
      (select versionid
       from xr.versionlabel
       where UPPER(label) = UPPER('Snapshot')))
    and deletedate is null;

It would appear that executing these SELECT COUNT(*) queries causes some internal state/row
counts to be updated, which in turn results in a different query plan than what the test usually
expects (even on faster machines).  But if that's the case, then it seems (to me) like the
updated state/row counts should cause the optimizer to choose a _better_ plan than what it
would otherwise choose--so it's not clear to me what's going on there.  Perhaps the updated
state/row counts are incorrect or are being handled incorrectly, causing the optimizer to
choose the wrong query plan...?

2) If I leave noTimeout set to false (its default) and do NOT add the "select count(*)" queries
mentioned above, but I add system shutdown/reconnect logic to the test just before executing
the query in question, then the test passes regardless of how fast/slow my CPU is.  Mike Matrigali
suggested that this is because system shutdown causes all state/row counts to be flushed to
disk, so reconnecting and then executing the query might allow the optimizer to use the latest
state/row counts to find the best plan.  Of course, if that's true then it seems like we'd
have at least a slight contradiction to the behavior described in #1 above, where updated
stats/row counts are causing the test to _fail_...

Note that if the "select count(*)" queries are included in the test, then the test will fail
regardless of whether or not we shutdown/reconnect.

The interaction between these behaviors and the noTimeout flag can be summarized as follows.

Let "TO" correspond to a noTimeout property that is TRUE.
  Then "!TO" is a noTimeout property that is FALSE (the
  default).

Let "SC" correspond to the presence of "select count(*)"
 queries in the test.  Then "!SC" means that such queries
 are NOT included.

Let "SH" correspond to the presence of shutdown/reconnect
 logic in the test.  Then "!SH" means that such logic is
 is NOT included.

Then we have the following behavior (at least, on my own laptop):

!TO && !SC && !SH ==> Intermittent failure on slower machines.
!TO && !SC &&  SH ==> Test passes regardless of CPU speed.
!TO &&  SC && !SH ==> Test fails regardless of CPU speed.
!TO &&  SC &&  SH ==> Test fails regardless of CPU speed.
 TO && !SC && !SH ==> Test passes regardless of CPU speed.
 TO && !SC &&  SH ==> Test passes regardless of CPU speed.
 TO &&  SC && !SH ==> Test passes regardless of CPU speed.
 TO &&  SC &&  SH ==> Test passes regardless of CPU speed.

Again, I'm not sure how much of this information is relevant info, but since these are all
things I noticed while investigating the problem, I'm including them for the sake of completeness/tracking,
in case they'll save someone some time in the future...

In the end, since setting noTimeout to TRUE seems like the simplest and most reliable way
to get the test to pass on a consistent basis, that is what I'm proposing as a tempory workaround
until further investigation can be carried out.  I will post a patch to do that shortly.


> predicatesIntoViews test failure on slow machine in Derby 10.1 branch version 201931
> ------------------------------------------------------------------------------------
>
>          Key: DERBY-407
>          URL: http://issues.apache.org/jira/browse/DERBY-407
>      Project: Derby
>         Type: Test
>   Components: Test
>     Versions: 10.1.1.0
>  Environment: Java Version:    1.4.2_04
> Java Vendor:     Hewlett-Packard Company
> Java home:       /usr/tandem/java/jre
> Java classpath:  /home/john/j2ee/derbyTests/derby-10.1-branch/jars/derby.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbytools.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbynet.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyclient.jar:/home/john/j2ee/derbyTests/common-jars/db2jcc.jar:/home/john/j2ee/derbyTests/common-jars/db2jcc_license_c.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyTesting.jar:/home/john/j2ee/derbyTests/common-jars/jakarta-oro-2.0.8.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_de_DE.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_es.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_fr.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_it.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_ja_JP.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_ko_KR.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_pt_BR.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_zh_CN.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_zh_TW.jar:/usr/tandem/java/lib/classes.zip:/usr/tandem/javaextv20/lib/tdmext.jar:/usr/tandem/jdbcMp/current/lib/sqlmp.jar:/usr/tandem/jdbcMx/current/lib/jdbcMx.jar:.
> OS name:         NONSTOP_KERNEL
> OS architecture: mips
> OS version:      G06
> Java user name:  SUPER.JOHN
> Java user home:  /home/john
> Java user dir:   /home/john/j2ee/derbyTests/derby-10.1-branch/testOutput
> java.specification.name: Java Platform API Specification
> java.specification.version: 1.4
>     Reporter: John Sisson
>  Attachments: derbylang_report.txt
>
> The predicatesIntoViews test (part of the derbylang suite) fails when running the tests
on a busy/slow system, but they pass when I run them on my fast Windows box.
> It appears that the failure could be due to different optimizations being performed due
to the difference in speed of the machines.  This needs to be confirmed that it is the case
of the test failure.  I will attach the test summary to this issue.
> I also noticed that the test report said 0% fail even though one test failed. 
> All derbylang tests passed on Windows XP. 
> Thanks, 
> John 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message