db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "A B (JIRA)" <j...@apache.org>
Subject [jira] Updated: (DERBY-47) Some possible improvements to IN optimization
Date Tue, 13 Mar 2007 00:04:09 GMT

     [ https://issues.apache.org/jira/browse/DERBY-47?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

A B updated DERBY-47:
---------------------

    Attachment: d47_mp_preprocess_v1.stat
                d47_mp_masters_v1.patch
                d47_mp_preprocess_v1.patch

Committed d47_mp_exec_v1.patch with svn 517470:

  URL: http://svn.apache.org/viewvc?view=rev&rev=517470

Attaching d47_mp_preprocess_v1.patch, which is (finally) the patch that actually creates "probe
predicates" during preprocessing, thus allowing the code changes in all previous patches to
take effect.  Once this patch is committed Derby will start re-writing IN lists as probe predicates
and, if the optimizer thinks it is best to do so, will start doing index "multi-probing" at
execution time to avoid excessive scanning.  The changes in this patch affect "preprocessing"
logic as follow:

  1. Replaces "A" with "B", where "A" is existing logic that creates a BETWEEN
     node for IN-lists containing all constants, and "B" is new logic that
     creates a "probe predicate" for IN-lists containing all constants *and/or*
     parameter nodes.  The probe predicates are then used throughout optimization,
     modification of access paths, code generation, and execution time (as
     appropriate) in the manner described by previous patches.

  2. Adds some additional logic to OrNode preprocessing to allow the conversion
     of queries like:

        select ... from T1 where i in (2, 3) or i in (7, 10)

     into queries that look like:

        select ... from T1 where i in (2, 3, 7, 10)

     This is really just an extension of the existing logic to transform a
     chain of OR nodes into an IN-list.

  3. Adds logic to PredicateList.pushExpressionsIntoSelect() to correctly
     copy "probe predicates" so that the left operand (column reference)
     is pointing to the correct place when we do static pushing of one-
     sided predicates (which is what a "probe predicate" is).

  4. Adds a new method to ValueNodeList that is used for checking to see if
     a list of IN values consists solely of constant and/or parameter nodes
     (there are no other expressions or column references).

I'm also attaching a corresponding patch, d47_mp_masters_v1.patch, which contains all of the
diffs caused by the new multi-probing functionality.  As is typical with tests that print
out query plans, a simple change in the execution-time behavior can lead to massive diffs.
 I manually looked at all of the diffs in the masters_v1.patch and it is my belief that all
but one of them are acceptable and expected given the changes for this issue.  The one exception
is for store/readlocks.sql, which is a test about which I know very little.  My guess (or
perhaps more accurately, my _hope_) is that this readlocks diff is okay, but it would be great
if someone who knows more about it could verify.

Note that I've separated preprocess_v1.patch from masters_v1.patch for ease of review, but
they both need to be committed at the same time in order to avoid failures in the nightly
regression tests.

As always, I'd appreciate it if anyone has the time look these changes over and make comments.
 If there are no objections I plan to commit these two patches Wednesday afternoon (March
14th, PST).

Remaining tasks once preprocess_v1 is committed:

  - Address the (excellent) review comments raised by Bryan in one or more
    follow-up patches.

  - Add test cases to verify that Derby is now behaving as desired in the
    face of IN list restrictions on columns with useful indexes.

  - Post some simple numbers to show the improvement that I see when running
    Derby47PerformanceTest.java (attached to this issue) before and after
    the changes for this issue.  Also, discuss a couple of areas to investigate
    post-DERBY-47 (i.e. things to consider as separate Jira issues).

I'm hoping to have most of this work posted by the end of the week, but of course that all
depends on the feedback I receive and the amount of other "stuff" that comes up between now
and then.

> Some possible improvements to IN optimization
> ---------------------------------------------
>
>                 Key: DERBY-47
>                 URL: https://issues.apache.org/jira/browse/DERBY-47
>             Project: Derby
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 10.0.2.0
>         Environment: all
>            Reporter: Sunitha Kambhampati
>         Assigned To: A B
>         Attachments: d47_engine_doNotCommit_v1.patch, d47_engine_doNotCommit_v1.stat,
d47_mp_addlTestCases.patch, d47_mp_CBO_MoAP_v1.patch, d47_mp_CBO_MoAP_v1.stat, d47_mp_codeGen_v1.patch,
d47_mp_codeGen_v1.stat, d47_mp_exec_v1.patch, d47_mp_exec_v1.stat, d47_mp_masters_v1.patch,
d47_mp_preprocess_v1.patch, d47_mp_preprocess_v1.stat, d47_mp_relOpPredCheck_v1.patch, d47_mp_relOpPredCheck_v1.stat,
derby-47-performance-data.txt, derby-47-performance-data.txt, Derby47PerformanceTest.java,
Derby47PerformanceTest.java, InListOperatorNode.java, QueryPlanUniqueIndexAndWordIndexOneTerm.txt,
QueryPlanUniqueIndexAndWordIndexTwoTerms.txt, QueryPlanUniqueIndexOnlyOneTerm.txt, QueryPlanUniqueIndexOnlyTwoTerms.txt,
readlocks.diff, readlocks_withContext.diff
>
>
> Consider a simple case of  - 
> A table tbl has 10000 rows, there is a primary key index on i1
> and the query in question is 
>  select * from tbl where i1 in (-1,100000)
> derby does a table scan of the entire table even though the "IN" list has only two values
and the comparison is on a field that has an index.
> Briefly looking at the code, it seems like we insert a between and use the IN list to
get the start and stop values for the scan. Thus the range of the values in the "IN" list
here plays an important role. 
> Thus if the query was changed to select * from tbl where i1 in (-1, 1), an index scan
would be chosen.
> It would be nice if we could do something clever in this case where there is clearly
an index on the field and the number of values in the IN list is known. Maybe use the rowcount
estimate and the IN list size to do some optimizations.  
> - consider the length of the "IN" list to do searches on the table.  ie use the IN list
values to do index key searches on the table,
> -or try to convert it to a join. Use the "IN" list values to create a temporary table
and do a join. It is most likely that the optimizer will choose the table with "IN" list here
as the outer table in the join and thus will do key searches on the larger table. 
> -------------------------------------------------------------------
> some query plans that I logged using derby.language.logQueryPlan=true for some similar
queries:
> Table has ascending values from 0 - 9999 for i1. primary key index on i1.
> GMT Thread[UT0,5,main] (XID = 19941), (SESSIONID = 0), select * from scanfixed where
i1 in (-1,9999,9998,9997,9996,9995,9994,9993,9992,9991,9990) ******* Project-Restrict ResultSet
(2):
> Number of opens = 1
> Rows seen = 10000
> Rows filtered = 9990
> restriction = true
> projection = false
> 	constructor time (milliseconds) = 0
> 	open time (milliseconds) = 0
> 	next time (milliseconds) = 0
> 	close time (milliseconds) = 0
> 	restriction time (milliseconds) = 0
> 	projection time (milliseconds) = 0
> 	optimizer estimated row count:          750.38
> 	optimizer estimated cost:         8579.46
> Source result set:
> 	Table Scan ResultSet for SCANFIXED at read committed isolation level using instantaneous
share row locking chosen by the optimizer
> 	Number of opens = 1
> 	Rows seen = 10000
> 	Rows filtered = 0
> 	Fetch Size = 16
> 		constructor time (milliseconds) = 0
> 		open time (milliseconds) = 0
> 		next time (milliseconds) = 0
> 		close time (milliseconds) = 0
> 		next time in milliseconds/row = 0
> 	scan information: 
> 		Bit set of columns fetched=All
> 		Number of columns fetched=9
> 		Number of pages visited=417
> 		Number of rows qualified=10000
> 		Number of rows visited=10000
> 		Scan type=heap
> 		start position: 
> null		stop position: 
> null		qualifiers:
> Column[0][0] Id: 0
> Operator: <=
> Ordered nulls: false
> Unknown return value: false
> Negate comparison result: false
> Column[0][1] Id: 0
> Operator: <
> Ordered nulls: false
> Unknown return value: true
> Negate comparison result: true
> 		optimizer estimated row count:          750.38
> 		optimizer estimated cost:         8579.46
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> l
> 2004-10-14 18:59:47.577 GMT Thread[UT0,5,main] (XID = 19216), (SESSIONID = 0), select
* from scanfixed where i1 in (9999,9998,9997,9996,9995,9994,9993,9992,9991,9990) ******* Project-Restrict
ResultSet (3):
> Number of opens = 1
> Rows seen = 10
> Rows filtered = 0
> restriction = true
> projection = true
> 	constructor time (milliseconds) = 0
> 	open time (milliseconds) = 0
> 	next time (milliseconds) = 0
> 	close time (milliseconds) = 0
> 	restriction time (milliseconds) = 0
> 	projection time (milliseconds) = 0
> 	optimizer estimated row count:            4.80
> 	optimizer estimated cost:           39.53
> Source result set:
> 	Index Row to Base Row ResultSet for SCANFIXED:
> 	Number of opens = 1
> 	Rows seen = 10
> 	Columns accessed from heap = {0, 1, 2, 3, 4, 5, 6, 7, 8}
> 		constructor time (milliseconds) = 0
> 		open time (milliseconds) = 0
> 		next time (milliseconds) = 0
> 		close time (milliseconds) = 0
> 		optimizer estimated row count:            4.80
> 		optimizer estimated cost:           39.53
> 		Index Scan ResultSet for SCANFIXED using index SCANFIXEDX at read committed isolation
level using instantaneous share row locking chosen by the optimizer
> 		Number of opens = 1
> 		Rows seen = 10
> 		Rows filtered = 0
> 		Fetch Size = 16
> 			constructor time (milliseconds) = 0
> 			open time (milliseconds) = 0
> 			next time (milliseconds) = 0
> 			close time (milliseconds) = 0
> 			next time in milliseconds/row = 0
> 		scan information: 
> 			Bit set of columns fetched=All
> 			Number of columns fetched=2
> 			Number of deleted rows visited=0
> 			Number of pages visited=2
> 			Number of rows qualified=10
> 			Number of rows visited=10
> 			Scan type=btree
> 			Tree height=2
> 			start position: 
> 	>= on first 1 column(s).
> 	Ordered null semantics on the following columns: 
> 			stop position: 
> 	> on first 1 column(s).
> 	Ordered null semantics on the following columns: 
> 			qualifiers:
> None
> 			optimizer estimated row count:            4.80
> 			optimizer estimated cost:           39.53

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message