db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "A B (JIRA)" <j...@apache.org>
Subject [jira] Commented: (DERBY-47) Some possible improvements to IN optimization
Date Mon, 12 Mar 2007 17:00:22 GMT

    [ https://issues.apache.org/jira/browse/DERBY-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12480146
] 

A B commented on DERBY-47:
--------------------------

Bryan Pendleton (JIRA) wrote:
> I finally got around to reading through the patches.  I haven't actually
> *run* any of the code, just given the patches a close reading

Thank you _very_ much for taking the time to do such a thorough review, Bryan.  I definitely
appreciate the feedback.  My attempted answers are below, but if anything is still unclear,
please do not hesitate to ask again.  Better to get this right the first time :)

> 1) In the relOpPredCheck patch, you made this change to PredicateList.java:
> 
> -			if (relop == null || ! relop.isQualifier(optTable, false))
> +			if (!pred.isRelationalOpPredicate() ||
> +				!pred.getRelop().isQualifier(optTable, false))
> 
>    I'm concerned that this change may have a hole for a possible NPE. Is it
>    possible that there could arise a case where we have an IN-list predicate
>    at this point, such that relop is null but in-list is not null, and then
>    pred.isRelationalOpPredicate() would return false, but pred.getRelop()
>    would return null? 

If we assume that we do in fact have a situation where "relop is null but in-list is not null",
then pred.isRelationalOpPredicate() will, as you said, return false.  That means that "!pred.isRelationalOpPredicate()"
will return true, and since we are using a short-circuited OR operator, I don't think we would
ever get to the pred.getRelop() call in that case, would we?

The intent was that we only get to the second part of the OR if we know for a fact that pred.getRelop()
will return a non-null value.  And if "pred" is a relational op predicate (i.e. if the first
part of the OR evaluates to "false") then that should always be the case.  It is of course
possible that I missed something and that the code doesn't follow the intent; please let me
know if you can think of such a case...

> 2) I spent some time studying the code in PredicateList.java which manipulates
>    IN list operator predicates, and got a bit twisted around.

[ snip code fragment ]

>   My naive reaction to code like this is "shouldn't this be handled in
>   pred.getSourceInList()"?

Good point.  When I was making the changes I wrote "getSourceInList()" specifically for the
new probe predicates; I was just leaving the existing InListOperatorNode logic as it was.
 But you're right, it's probably cleaner to expand that method to cover the "old" IN-list
cases, as well.  I will look into this for a follow-up patch.

>   Can you help me understand why are there these two different ways to get
>   the InListOperatorNode for a predicate (via getSourceInList() and via 
>   getAndNode.getLeftOperand() ), 

The former (getSourceInList()) was added to handle DERBY-47 probe predicates while the latter
(getAndNode.getLeftOperand()) was already there for handling the "normal" (pre DERBY-47) InLists.
 But you're right, it might be good to combine the two--I'll try to do that.

> 3) The new method PredicateList.generateInListValues() seems to go to some
>    pains to handle the case where it is called, but this predicate list
>    actually doesn't have any in list values to generate. My reaction was:
>    "why is this routine getting called in this case?" It seems like
>    NestedLoopJoinStrategy never calls generateInListValues unless there are
>    actually in-list values to generate, so why does generateInListValues
>    need to handle the case where there is no InListOperatorNode?
> 
>    That is, does the final "else" clause in generateInListValues ever actually
>    occur?

Great catch!  When I first wrote the DERBY-47 changes I had all of the probing logic inside
the TableScanResultSet class, and thus the additional arguments were always generated, even
in cases where no probe predicates existed.  I later realized that this was too much overhead
since most TableScanResultSets will probably *not* be doing multi-probing.  So I refactored
the code and created a new result set, MultiProbeTableScanResultSet, which is only generated
when probe predicates exist.  But I forgot to remove the corresponding logic from the generateInListValues()
method, hence the "else".

So you're absolutely right: the "else" clause can be removed here and the code can be cleaned
up accordingly.  I'll address that in a follow-up patch.

> 4) The code which traverses predicate lists seems to always traverse them
>    in backwards order, e.g. this code from HashJoinStrategy.java:
> 
>             for (int i = storeRestrictionList.size() - 1; i >= 0; i--)
> 
>    Why do we always traverse these backwards? Is this just an optimization
>    in order to call the size() method only once? Or is there something
>    deeper going on?

This is just habit more than anything.  A lot of the optimizer-related work that I've been
doing requires the removal of certain predicates from predicate lists at various points in
the code, and in that case reverse traversal is better (because removal of elements from the
rear does not require adjustments to the loop index).  So I often write loop iteration with
backwards traversal out of sheer habit, even when removal of predicates is not happening.
 If you think this makes the code more confusing or less intuitive I have no problems with
switching to forward traversal.

> 5) In InListOperatorNode, I at first thought that you had put this code in,
>    but then after more study I saw that this was just an artifact of the
>    patch program and you hadn't introduced this code, but just re-arranged it
>    a bit. Still, the code showed up with "+" signs in the patch so I looked
>    at it, and feel compelled to comment on it.
> 
>    This is at about line 413 in InListOperatorNode.java:
> 
>         //LocalField receiverField =
>         //  acb.newFieldDeclaration(Modifier.PRIVATE, receiverType);
> 
>         leftOperand.generateExpression(acb, mb);
>         mb.dup();
>         //mb.putField(receiverField); // instance for method call
>         /*mb.getField(receiverField);*/ mb.upCast(leftInterfaceType); // first arg
> 
>    Do you know what is going on here? What is this commented out code, and
>    can we clean it up? The last line in particular is quite frustrating with
>    its snippet of actual code sandwiched in between two comments on the line.

Short answer is that I don't know what the commented out code is for, and my guess is that
Yes, it can (and should) be cleaned up.  But I frequently comment on other people's patches
that they should refrain from making unrelated "cleanup" changes as it makes the patches harder
to review, so I thought maybe I should follow my own advice ;)  Feel free to commit this two-line
cleanup as a separate patch if you'd like--or I can do it myself, as well.  Either way, so
long as its a separate commit, I think that's a good idea.

> 6) I seem to recall some discussions on the mailing lists from people who
>    had code generation systems which tended to generate horrific IN clauses
>    with, for example, upwards of 1500 elements in the IN list. I think it would
>    be nice to have a test or two which verified that we can actually compile
>    and run such a SQL statement. You may have already done this, and I missed
>    it, but most of the test cases in the addTestCases patch seemed to have
>    IN list clauses with no more than 8-10 values in them.

There is an existing test case in lang/inbetween.sql that has such a test for "normal" (pre
DERBY-47) IN-list processing.  The query in that case has 4056 values (constants) in it.

I was also looking at the repro attached to DERBY-47, which generates and executes IN-lists
with up to 2500 and more values.  I'd like to pull out the relevant pieces and include them
in a new JUnit test as a way to verify that thing works for large IN lists when multi-probing
is in effect, as well.  That's still a forthcoming patch.  For the record, though, I ran the
repro on a database with 100,000 rows in it and a single IN-list of 2500 values, and everything
worked fine--even when multi-probing was in effect.  This is as I would expect since the code
to generate the IN-list is the same before and after my changes.

> 7) Generally speaking, I like the new convenience methods in Predicate.java,
>    but I do worry about whether we need to start being concerned about
>    overall efficiency in the optimizer. For example, consider a change like:
> 
>  			RelationalOperator relop = pred.getRelop();
>  
> -			if (relop != null)
> +			if (pred.isRelationalOpPredicate())
> 
>    since isRelationalOpPredicate may end up calling getRelop() two more times,
>    and since getRelop() calls getLeftOperand() twice and does an instanceof
>    check, we're incrementally adding a fair amount of code.

Great observation.

>    It's really important that the optimizer code be clear and easy to read,
>    so I'm not suggesting any drastic measures. I'm just sort of thinking
>    out loud about some concerns about how to write an optimizer which is
>    simultaneously both
>    - clear and easy to read and understand (this is the most important)
>    - yet also efficient to run

This is a very valid point.  I've been focusing on the first goal, but you're right, the above
change is perhaps too costly.  I'll look into remedying this with a follow-up patch...

Thank you again for these excellent review comments.  I really (really) appreciate the extra
pair of the eyes and the great suggestions.  As I said above, if any of the above answers
are unclear or unsatisfactory, please ask again--I'll try to explain whatever it is that may
not make sense.

PS I plan to commit the exec_v1.plan patch later today (Monday) unless I hear objections from
anyone...

> Some possible improvements to IN optimization
> ---------------------------------------------
>
>                 Key: DERBY-47
>                 URL: https://issues.apache.org/jira/browse/DERBY-47
>             Project: Derby
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 10.0.2.0
>         Environment: all
>            Reporter: Sunitha Kambhampati
>         Assigned To: A B
>         Attachments: d47_engine_doNotCommit_v1.patch, d47_engine_doNotCommit_v1.stat,
d47_mp_addlTestCases.patch, d47_mp_CBO_MoAP_v1.patch, d47_mp_CBO_MoAP_v1.stat, d47_mp_codeGen_v1.patch,
d47_mp_codeGen_v1.stat, d47_mp_exec_v1.patch, d47_mp_exec_v1.stat, d47_mp_relOpPredCheck_v1.patch,
d47_mp_relOpPredCheck_v1.stat, derby-47-performance-data.txt, derby-47-performance-data.txt,
Derby47PerformanceTest.java, Derby47PerformanceTest.java, InListOperatorNode.java, QueryPlanUniqueIndexAndWordIndexOneTerm.txt,
QueryPlanUniqueIndexAndWordIndexTwoTerms.txt, QueryPlanUniqueIndexOnlyOneTerm.txt, QueryPlanUniqueIndexOnlyTwoTerms.txt,
readlocks.diff, readlocks_withContext.diff
>
>
> Consider a simple case of  - 
> A table tbl has 10000 rows, there is a primary key index on i1
> and the query in question is 
>  select * from tbl where i1 in (-1,100000)
> derby does a table scan of the entire table even though the "IN" list has only two values
and the comparison is on a field that has an index.
> Briefly looking at the code, it seems like we insert a between and use the IN list to
get the start and stop values for the scan. Thus the range of the values in the "IN" list
here plays an important role. 
> Thus if the query was changed to select * from tbl where i1 in (-1, 1), an index scan
would be chosen.
> It would be nice if we could do something clever in this case where there is clearly
an index on the field and the number of values in the IN list is known. Maybe use the rowcount
estimate and the IN list size to do some optimizations.  
> - consider the length of the "IN" list to do searches on the table.  ie use the IN list
values to do index key searches on the table,
> -or try to convert it to a join. Use the "IN" list values to create a temporary table
and do a join. It is most likely that the optimizer will choose the table with "IN" list here
as the outer table in the join and thus will do key searches on the larger table. 
> -------------------------------------------------------------------
> some query plans that I logged using derby.language.logQueryPlan=true for some similar
queries:
> Table has ascending values from 0 - 9999 for i1. primary key index on i1.
> GMT Thread[UT0,5,main] (XID = 19941), (SESSIONID = 0), select * from scanfixed where
i1 in (-1,9999,9998,9997,9996,9995,9994,9993,9992,9991,9990) ******* Project-Restrict ResultSet
(2):
> Number of opens = 1
> Rows seen = 10000
> Rows filtered = 9990
> restriction = true
> projection = false
> 	constructor time (milliseconds) = 0
> 	open time (milliseconds) = 0
> 	next time (milliseconds) = 0
> 	close time (milliseconds) = 0
> 	restriction time (milliseconds) = 0
> 	projection time (milliseconds) = 0
> 	optimizer estimated row count:          750.38
> 	optimizer estimated cost:         8579.46
> Source result set:
> 	Table Scan ResultSet for SCANFIXED at read committed isolation level using instantaneous
share row locking chosen by the optimizer
> 	Number of opens = 1
> 	Rows seen = 10000
> 	Rows filtered = 0
> 	Fetch Size = 16
> 		constructor time (milliseconds) = 0
> 		open time (milliseconds) = 0
> 		next time (milliseconds) = 0
> 		close time (milliseconds) = 0
> 		next time in milliseconds/row = 0
> 	scan information: 
> 		Bit set of columns fetched=All
> 		Number of columns fetched=9
> 		Number of pages visited=417
> 		Number of rows qualified=10000
> 		Number of rows visited=10000
> 		Scan type=heap
> 		start position: 
> null		stop position: 
> null		qualifiers:
> Column[0][0] Id: 0
> Operator: <=
> Ordered nulls: false
> Unknown return value: false
> Negate comparison result: false
> Column[0][1] Id: 0
> Operator: <
> Ordered nulls: false
> Unknown return value: true
> Negate comparison result: true
> 		optimizer estimated row count:          750.38
> 		optimizer estimated cost:         8579.46
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> l
> 2004-10-14 18:59:47.577 GMT Thread[UT0,5,main] (XID = 19216), (SESSIONID = 0), select
* from scanfixed where i1 in (9999,9998,9997,9996,9995,9994,9993,9992,9991,9990) ******* Project-Restrict
ResultSet (3):
> Number of opens = 1
> Rows seen = 10
> Rows filtered = 0
> restriction = true
> projection = true
> 	constructor time (milliseconds) = 0
> 	open time (milliseconds) = 0
> 	next time (milliseconds) = 0
> 	close time (milliseconds) = 0
> 	restriction time (milliseconds) = 0
> 	projection time (milliseconds) = 0
> 	optimizer estimated row count:            4.80
> 	optimizer estimated cost:           39.53
> Source result set:
> 	Index Row to Base Row ResultSet for SCANFIXED:
> 	Number of opens = 1
> 	Rows seen = 10
> 	Columns accessed from heap = {0, 1, 2, 3, 4, 5, 6, 7, 8}
> 		constructor time (milliseconds) = 0
> 		open time (milliseconds) = 0
> 		next time (milliseconds) = 0
> 		close time (milliseconds) = 0
> 		optimizer estimated row count:            4.80
> 		optimizer estimated cost:           39.53
> 		Index Scan ResultSet for SCANFIXED using index SCANFIXEDX at read committed isolation
level using instantaneous share row locking chosen by the optimizer
> 		Number of opens = 1
> 		Rows seen = 10
> 		Rows filtered = 0
> 		Fetch Size = 16
> 			constructor time (milliseconds) = 0
> 			open time (milliseconds) = 0
> 			next time (milliseconds) = 0
> 			close time (milliseconds) = 0
> 			next time in milliseconds/row = 0
> 		scan information: 
> 			Bit set of columns fetched=All
> 			Number of columns fetched=2
> 			Number of deleted rows visited=0
> 			Number of pages visited=2
> 			Number of rows qualified=10
> 			Number of rows visited=10
> 			Scan type=btree
> 			Tree height=2
> 			start position: 
> 	>= on first 1 column(s).
> 	Ordered null semantics on the following columns: 
> 			stop position: 
> 	> on first 1 column(s).
> 	Ordered null semantics on the following columns: 
> 			qualifiers:
> None
> 			optimizer estimated row count:            4.80
> 			optimizer estimated cost:           39.53

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message