db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "A B (JIRA)" <derby-...@db.apache.org>
Subject [jira] Commented: (DERBY-407) predicatesIntoViews test failure on slow machine in Derby 10.1 branch version 201931
Date Mon, 25 Jul 2005 18:22:39 GMT
    [ http://issues.apache.org/jira/browse/DERBY-407?page=comments#action_12316675 ] 

A B commented on DERBY-407:
---------------------------

If you trim out the extra lines of the diff, the failure comes
down to the following difference in query plan (this is just one
tiny part of a very large query plan):

Master file --

	Left result set  = Table scan on CLASSIFICATION_VALUES (chosen by the optimizer)
	Right result set = Hash scan on REPOSITORYOBJECTRESOURCE

Actual (when the test fails) --

	Left result set = Index scan on REPOSITORYOBJECTRESOURCE (chosen by the optimizer)
	Right result set = Hash scan on CLASSIFICATION_VALUES

I don't know enough about how the optimizer works to know
if one of these plans is inherently "better" than the other
(I think it depends on how many rows are in the underlying
tables).  However, I have confirmed (through discussion with
Mike Matrigali) that the plan chosen by the optimizer _can_
be affected by the speed of the machine.  Internally,
the optimizer does a check to make sure that
it doesn't spend too much time trying to find the best
query plan--and it will do a "timeout" if it thinks it's
taking too long.  In particular, if the optimizer has
already spent more time choosing a query plan than it
thinks the best query plan so far is going to take to
execute, the optimizer will stop looking at plans and
just use the best plan so far.

What this means is that, on a faster machine, the
optimizer can potentially see more query plans before
timing out than it would on a slower machine--and thus
it might find a better plan on the faster machine.

So far as I can tell, that's what's happening with this
predicatesIntoViews failure.  While trying to prove this
point, I noticed that there is an undocumented user
property called "derby.optimizer.noTimeout" that keeps
the optimizer from timing out--i.e. the optimizer will process
ALL query plans and then pick the best one, regardless
of how long it takes to do so.  When I set this property
to true and ran the test, it passed 100% of the time,
regardless of the speed of my CPU.

That said, I think an initial work-around for this problem
is to set the derby.optimizer.noTimeout property for this
test to true (using the harness).  That should allow the
test to pass on a consistent basis until someone can
look further at the massive query plan to see what's
happening and to determine if this particular failure
is an optimizer bug (see notes below) or is just
'working as designed'.


-----------
Other notes
-----------

While investigating this failure, I noticed a couple of
behavioral oddities that could potentially help to
determine if this is an optimizer bug.  I'm including
them here for tracking purposes.  Note that these
behaviors are only relevant if the noTimeout property
is NOT set (if it _is_ set, then the test will always
pass, regardless of what I describe below).

1) In an attempt to debug the failure, I added a bunch
of "select count(*)" queries to the test to see how
large the tables used by the query plan were, and also
to see how large the result set of the query was.  As it
turned out, adding such queries caused the test to
consistently choose the second query plan listed above--
i.e. the "fail" plan--regardless of the speed of my
CPU.  The specific queries I added are these
(I added them just before execution of the query
that's seeing the diff):

  SELECT COUNT(*) FROM XR.CLASSIFICATION_VALUES;
  SELECT COUNT(*) FROM XR.REPOSITORYOBJECTRESOURCE;
  SELECT COUNT(*) FROM XR.REPOSITORYOBJECTALLVERSIONVIEW;
  SELECT count(*)
  FROM xr.repositoryobjectallversionview rov 
  where (uname = UPPER('two') or uname = UPPER('my project'))
  and (versionid in
      (select versionid
       from xr.versionlabel
       where UPPER(label) = UPPER('Snapshot')))
    and deletedate is null;

It would appear that executing these SELECT COUNT(*)
queries causes some internal state/statistics to be
updated, which in turn results in a different query
plan than what the test usually expects (even on faster
machines).  But if that's the case, then it seems (to me)
like the updated state/statistics should cause the
optimizer to choose a _better_ plan than what it would
otherwise choose--so it's not clear to me what's going
on there.  Perhaps the updated state/statistics are
incorrect or are being handled incorrectly, causing
the optimizer to choose the wrong query plan...?

2) If I leave noTimeout set to false (its default) and
do NOT add the "select count(*)" queries mentioned above,
but I add system shutdown/reconnect logic to the test just
before executing the query in question, then the test
passes regardless of how fast/slow my CPU is.  Mike
Matrigali suggested that this is because system shutdown
causes all state/row counts to be flushed to disk, so
reconnecting and then executing the query might allow the
optimizer to use the latest state/row counts to find
the best plan.  Of course, if that's true then it seems
like we'd have at least a slight contradiction to the
behavior described in #1 above, where updated stats/row
counts are causing the test to _fail_.

Note that if the "select count(*)" queries are included
in the test, then the test will fail regardless of
whether or not we shutdown/reconnect.

The interaction between these behaviors and the noTimeout
flag can be summarized as follows.

Let "TO" correspond to a noTimeout property that is TRUE.
  Then "!TO" is a noTimeout property that is FALSE (the
  default).

Let "SC" correspond to the presence of "select count(*)"
 queries in the test.  Then "!SC" means that such queries
 are NOT included.

Let "SH" correspond to the presence of shutdown/reconnect
 logic in the test.  Then "!SH" means that such logic is
 is NOT included.

Then we have the following behavior (at least, on my own
laptop):

!TO && !SC && !SH ==> Intermittent failure on slower machines.
!TO && !SC &&  SH ==> Test passes regardless of CPU speed.
!TO &&  SC && !SH ==> Test fails regardless of CPU speed.
!TO &&  SC &&  SH ==> Test fails regardless of CPU speed.
 TO && !SC && !SH ==> Test passes regardless of CPU speed.
 TO && !SC &&  SH ==> Test passes regardless of CPU speed.
 TO &&  SC && !SH ==> Test passes regardless of CPU speed.
 TO &&  SC &&  SH ==> Test passes regardless of CPU speed.

Again, I'm not sure how much of this information is
relevant info, but since these are all things I noticed
while investigating the problem, I'm including them
for the sake of completeness/tracking, in case they'll
save someone some time in the future...

In the end, since setting noTimeout to TRUE seems like
the simplest and most reliable way to get the test to
pass on a consistent basis, that is what I'm proposing 
as a tempory workaround until further investigation can 
be carried out.  I will post a patch to do that shortly.


> predicatesIntoViews test failure on slow machine in Derby 10.1 branch version 201931
> ------------------------------------------------------------------------------------
>
>          Key: DERBY-407
>          URL: http://issues.apache.org/jira/browse/DERBY-407
>      Project: Derby
>         Type: Test
>   Components: Test
>     Versions: 10.1.1.0
>  Environment: Java Version:    1.4.2_04
> Java Vendor:     Hewlett-Packard Company
> Java home:       /usr/tandem/java/jre
> Java classpath:  /home/john/j2ee/derbyTests/derby-10.1-branch/jars/derby.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbytools.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbynet.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyclient.jar:/home/john/j2ee/derbyTests/common-jars/db2jcc.jar:/home/john/j2ee/derbyTests/common-jars/db2jcc_license_c.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyTesting.jar:/home/john/j2ee/derbyTests/common-jars/jakarta-oro-2.0.8.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_de_DE.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_es.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_fr.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_it.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_ja_JP.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_ko_KR.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_pt_BR.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_zh_CN.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_zh_TW.jar:/usr/tandem/java/lib/classes.zip:/usr/tandem/javaextv20/lib/tdmext.jar:/usr/tandem/jdbcMp/current/lib/sqlmp.jar:/usr/tandem/jdbcMx/current/lib/jdbcMx.jar:.
> OS name:         NONSTOP_KERNEL
> OS architecture: mips
> OS version:      G06
> Java user name:  SUPER.JOHN
> Java user home:  /home/john
> Java user dir:   /home/john/j2ee/derbyTests/derby-10.1-branch/testOutput
> java.specification.name: Java Platform API Specification
> java.specification.version: 1.4
>     Reporter: John Sisson
>  Attachments: derbylang_report.txt
>
> The predicatesIntoViews test (part of the derbylang suite) fails when running the tests
on a busy/slow system, but they pass when I run them on my fast Windows box.
> It appears that the failure could be due to different optimizations being performed due
to the difference in speed of the machines.  This needs to be confirmed that it is the case
of the test failure.  I will attach the test summary to this issue.
> I also noticed that the test report said 0% fail even though one test failed. 
> All derbylang tests passed on Windows XP. 
> Thanks, 
> John 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message