db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kevin Hore (JIRA)" <derby-...@db.apache.org>
Subject [jira] Commented: (DERBY-47) Some possible improvements to IN optimization
Date Tue, 15 Nov 2005 15:24:28 GMT
    [ http://issues.apache.org/jira/browse/DERBY-47?page=comments#action_12357700 ] 

Kevin Hore commented on DERBY-47:
---------------------------------

I have evidence (presented below) that the query optimizer is making some very poor choices
when deciding how to make use of the available indexes. Unfortunately, this is making Derby
unusable for the application I'm working on.

Clearly, the Derby engine is fundamentally capable of executing queries in a sensible amount
of time, but the query optimizer appears to get confused if there are several indexes to choose
from and multiple references to the same column in the WHERE clause.

In my previous, comment I gave details of simple example that demonstrated the problem. To
recap, I am executing variants of the following SQL against a relatively simple table:

SELECT ObjectId, SUM(WordLocation) AS Score
FROM tblSearchDictionary
WHERE Word = 'CONTACT' OR  Word = 'ADD'
GROUP BY ObjectId;

It makes no difference to the query performance or query plan if the WHERE clause is as above,
or is re-written as:

WHERE Word IN ('CONTACT', 'ADD')

-- ORIGINAL RESULTS: --

The timings with the schema described in my previous comment are:

Matching one term: ~200ms avg.
Matching two terms: ~10000ms avg.

The "matching one term" timings are with the following WHERE clause:

WHERE Word = 'CONTACT'

(searching just for 'ADD' gives similar timings).

The "matching two terms" timings are with the following WHERE clause:

WHERE Word = 'CONTACT' OR Word = 'ADD'

The query plans for these timings can be found in my previous comment.

With some suggestions from the derby-users list, I have attempted to redefine the indexes
on the table to see if that will have any effect on the choices made by the query planner.

Dropping all the column indexes and then reordering the unique constraint so that the most
varying column is first, and the least varying last, caused the query planner to change its
plan to use the index backing the unique constraint to match the terms.

DROP INDEX TBLSEARCHDICTIONARYOBJECTID;
DROP INDEX TBLSEARCHDICTIONARYOBJECTTYPE;
DROP INDEX TBLSEARCHDICTIONARYWORD;
DROP INDEX TBLSEARCHDICTIONARYWORDLOCATION;
ALTER TABLE TBLSEARCHDICTIONARY DROP UNIQUE CONSd0e222
ALTER TABLE TBLSEARCHDICTIONARY ADD CONSTRAINT CONSd0e222 UNIQUE
(ObjectId,Word,WordLocation,ObjectType);

However from the results you can see that the search for a single term suffered a big loss
in performance.

-- RESULTS WITH OPTIMIZED CONSTRAINT:  --

Matching one term: ~4000ms avg.
Matching two terms: ~600ms avg.

I have attached the relevant query plans for each of these results.

This is a very surprising result: matching one term performs far worse than matching two.

In an attempt to remedy problem of the poor single term performance, I re-introduced the index
to the Word column with the following schema change:

CREATE INDEX tblSearchDictionaryWord ON tblSearchDictionary (Word );

-- RESULTS WITH OPTIMIZED CONSTRAINT AND INDEX ON WORD FIELD:  --

Matching one term: ~200ms avg.
Matching two terms: ~4500 ms avg.

Again, I have attached the relevant query plans.

Although the additional index is used in the single term query, it can be seen that adding
the additional index has the effect of causing the planner to get confused about which indexes
to use for the two term case. The plan here shows an index scan of both indexes, resulting
in an average time seven times worse than before.

I should add here that replacing the WHERE OR clause for the two term query with an IN() produces
exactly the same plan and results, which is what one would expect based on the Derby documentation's
description of how the optimizer re-writes WHERE OR clauses.

It seems to me that the optimizer is broken when processing WHERE clauses with OR and WHERE
clauses with IN. The engine is obviously capable of executing my example in a reasonable time,
but the planner's choices let it down. This causes significant problems in large applications
with many complex queries, particularly where the number of terms in an IN clause varies,
as it then becomes almost impossible to construct queries that execute with reliably acceptable
performance.

It may be interesting to note that the application I'm working on also can also use Mckoi
or SQL Server 2000. Both the single and double term searches execute with acceptable performance
with both of those database engines.

-- 
Kevin Hore
www.araxis.com


> Some possible improvements to IN optimization
> ---------------------------------------------
>
>          Key: DERBY-47
>          URL: http://issues.apache.org/jira/browse/DERBY-47
>      Project: Derby
>         Type: Improvement
>   Components: SQL
>     Versions: 10.0.2.0
>  Environment: all
>     Reporter: Sunitha Kambhampati

>
> Consider a simple case of  - 
> A table tbl has 10000 rows, there is a primary key index on i1
> and the query in question is 
>  select * from tbl where i1 in (-1,100000)
> derby does a table scan of the entire table even though the "IN" list has only two values
and the comparison is on a field that has an index.
> Briefly looking at the code, it seems like we insert a between and use the IN list to
get the start and stop values for the scan. Thus the range of the values in the "IN" list
here plays an important role. 
> Thus if the query was changed to select * from tbl where i1 in (-1, 1), an index scan
would be chosen.
> It would be nice if we could do something clever in this case where there is clearly
an index on the field and the number of values in the IN list is known. Maybe use the rowcount
estimate and the IN list size to do some optimizations.  
> - consider the length of the "IN" list to do searches on the table.  ie use the IN list
values to do index key searches on the table,
> -or try to convert it to a join. Use the "IN" list values to create a temporary table
and do a join. It is most likely that the optimizer will choose the table with "IN" list here
as the outer table in the join and thus will do key searches on the larger table. 
> -------------------------------------------------------------------
> some query plans that I logged using derby.language.logQueryPlan=true for some similar
queries:
> Table has ascending values from 0 - 9999 for i1. primary key index on i1.
> GMT Thread[UT0,5,main] (XID = 19941), (SESSIONID = 0), select * from scanfixed where
i1 in (-1,9999,9998,9997,9996,9995,9994,9993,9992,9991,9990) ******* Project-Restrict ResultSet
(2):
> Number of opens = 1
> Rows seen = 10000
> Rows filtered = 9990
> restriction = true
> projection = false
> 	constructor time (milliseconds) = 0
> 	open time (milliseconds) = 0
> 	next time (milliseconds) = 0
> 	close time (milliseconds) = 0
> 	restriction time (milliseconds) = 0
> 	projection time (milliseconds) = 0
> 	optimizer estimated row count:          750.38
> 	optimizer estimated cost:         8579.46
> Source result set:
> 	Table Scan ResultSet for SCANFIXED at read committed isolation level using instantaneous
share row locking chosen by the optimizer
> 	Number of opens = 1
> 	Rows seen = 10000
> 	Rows filtered = 0
> 	Fetch Size = 16
> 		constructor time (milliseconds) = 0
> 		open time (milliseconds) = 0
> 		next time (milliseconds) = 0
> 		close time (milliseconds) = 0
> 		next time in milliseconds/row = 0
> 	scan information: 
> 		Bit set of columns fetched=All
> 		Number of columns fetched=9
> 		Number of pages visited=417
> 		Number of rows qualified=10000
> 		Number of rows visited=10000
> 		Scan type=heap
> 		start position: 
> null		stop position: 
> null		qualifiers:
> Column[0][0] Id: 0
> Operator: <=
> Ordered nulls: false
> Unknown return value: false
> Negate comparison result: false
> Column[0][1] Id: 0
> Operator: <
> Ordered nulls: false
> Unknown return value: true
> Negate comparison result: true
> 		optimizer estimated row count:          750.38
> 		optimizer estimated cost:         8579.46
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> l
> 2004-10-14 18:59:47.577 GMT Thread[UT0,5,main] (XID = 19216), (SESSIONID = 0), select
* from scanfixed where i1 in (9999,9998,9997,9996,9995,9994,9993,9992,9991,9990) ******* Project-Restrict
ResultSet (3):
> Number of opens = 1
> Rows seen = 10
> Rows filtered = 0
> restriction = true
> projection = true
> 	constructor time (milliseconds) = 0
> 	open time (milliseconds) = 0
> 	next time (milliseconds) = 0
> 	close time (milliseconds) = 0
> 	restriction time (milliseconds) = 0
> 	projection time (milliseconds) = 0
> 	optimizer estimated row count:            4.80
> 	optimizer estimated cost:           39.53
> Source result set:
> 	Index Row to Base Row ResultSet for SCANFIXED:
> 	Number of opens = 1
> 	Rows seen = 10
> 	Columns accessed from heap = {0, 1, 2, 3, 4, 5, 6, 7, 8}
> 		constructor time (milliseconds) = 0
> 		open time (milliseconds) = 0
> 		next time (milliseconds) = 0
> 		close time (milliseconds) = 0
> 		optimizer estimated row count:            4.80
> 		optimizer estimated cost:           39.53
> 		Index Scan ResultSet for SCANFIXED using index SCANFIXEDX at read committed isolation
level using instantaneous share row locking chosen by the optimizer
> 		Number of opens = 1
> 		Rows seen = 10
> 		Rows filtered = 0
> 		Fetch Size = 16
> 			constructor time (milliseconds) = 0
> 			open time (milliseconds) = 0
> 			next time (milliseconds) = 0
> 			close time (milliseconds) = 0
> 			next time in milliseconds/row = 0
> 		scan information: 
> 			Bit set of columns fetched=All
> 			Number of columns fetched=2
> 			Number of deleted rows visited=0
> 			Number of pages visited=2
> 			Number of rows qualified=10
> 			Number of rows visited=10
> 			Scan type=btree
> 			Tree height=2
> 			start position: 
> 	>= on first 1 column(s).
> 	Ordered null semantics on the following columns: 
> 			stop position: 
> 	> on first 1 column(s).
> 	Ordered null semantics on the following columns: 
> 			qualifiers:
> None
> 			optimizer estimated row count:            4.80
> 			optimizer estimated cost:           39.53

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message