While trying to complete the Phase 4 work for DERBY805 I've come across some
behavior in the optimizer's treatment of cost estimates that I don't fully
understand. If anyone out there can offer answers/feedback for any of the
following three questions, I'd appreciate it...

1) Multiplication by outer rows.

There are four classesProjectRestrictNode, UnionNode, JoinNode, and
FromTablethat implement the "optimizeIt()" method and, in that method, call
"getJoinStrategy().estimateCost(...)". One of the arguments to optimizeIt() is
an outerCost value which holds the estimated cost for FromTables left of the
current one for this join order. So if we have a FROM list that is [ FT1, FT2,
FT3 ] and the current FromTable (the one we're optimizing) is FT2, then
outerCost will hold the estimated cost for FT1. For the sake of this question,
assume FT1, FT2, and FT3 are instances of one of the abovementioned classes.
outerCost is then used in two ways by these FTs: a) it's passed down to the
children nodes, where it's factored into the cost estimates of those children,
and b) it's passed into a call to estimate the cost of the current FT itself.
As an example, take UnionNode. The a) code is like this:
leftResultSet = optimizeSource(
optimizer,
leftResultSet,
getLeftOptPredicateList(),
outerCost); // <==
rightResultSet = optimizeSource(
optimizer,
rightResultSet,
getRightOptPredicateList(),
outerCost); // <==
The b) code is then:
/*
** Get the cost of this result set in the context of the whole plan.
*/
getCurrentAccessPath().
getJoinStrategy().
estimateCost(
this,
predList,
(ConglomerateDescriptor) null,
outerCost, // <==
optimizer,
costEstimate
);
In both cases, the only part of outerCost that's actually used is the estimated
row count. In cases where we're costing a Nested Loop Join (either for the
current FT or for any of it's children), we multiply the cost of the current FT
(or its children) by the number of outer rows, since we will in effect be
incurring the cost of the inner FT for each row of the outer FT.
All of that said, my question is this: why do we need to pass the outerCost down
to the children, where it will be multiplied by their costs, AND at the same
time pass it into the "getJoinStrategy().estimateCost()" method, where it will
be multiplied (again) by the cost of the current FT? It seems to me the like
the correct thing to do here is to _not_ pass outerCost to the children and to
instead just let the current FT do the multiplication if necessary.
As a trivialized example, assume our FT2 is a UnionNode over two subqueries,
each of which just has a single table:
FT2 = (select * from T1) UNION (select * from T2)
And assume FT1 is an outer table with row count of 100. Then in the part a)
code above, we'll pass "100" down to the left child, where it will eventually
make it down to T1 and we'll multiply the cost of T1 by 100. Likewise we'll
pass "100" down to the right child and multiply the cost of T2 by 100:
T1 estimated cost: 100 * T1_cost
T2 estimated cost: 100 * T2_cost
Then we'll set the cost of the UnionNode to be the sum of the estimated costs of
its children:
FT2 estimated cost: (100 * T1_cost) + (100 * T2_cost)
That done, if the current join strategy for FT2 is NestedLoop, the call in part
b) above will pass "100" to NestedLoopJoinStrategy.estimateCost(), where it will
be multiplied by the cost of FT2:
FT2 estimatedCost: 100 * ((100 * T1_cost) + (100 * T2_cost))
In other words, we've ended up multiplying the cost of FT2 by the outer row
count *TWICE*.
I guess this could be intentional in order to make sure we favor hash joins
(which won't do the second multiplicationsee below) over nested loop joins,
but it'd be great if someone could confirm this one way or the other. As it is,
it seems a bit odd to me.

2) HashJoinStrategy.estimateCost()

As mentioned at the end of question 1, if the join strategy that we're currently
costing is a HashJoin and we make a call to HashJoinStrategy.estimateCost() (see
above for an example), we get here:
/** @see JoinStrategy#estimateCost */
public void estimateCost(Optimizable innerTable,
OptimizablePredicateList predList,
ConglomerateDescriptor cd,
CostEstimate outerCost,
Optimizer optimizer,
CostEstimate costEstimate) {
/*
** The cost of a hash join is the cost of building the hash table.
** There is no extra cost per outer row.
*/
}
I.e. we don't do anything. But at the same time, if we're costing a
FromBaseTable and considering a HashJoin, we _do_ actually use the outer row
count. See FromBaseTable.estimateCost():
/*
** Let the join strategy decide whether the cost of the base
** scan is a single scan, or a scan per outer row.
** NOTE: In this case we only want to multiply against the
** total row count, not the singleScanRowCount.
*/
if (currentJoinStrategy.multiplyBaseCostByOuterRows())
{
newCost *= outerCost.rowCount();
}
rowCount *= outerCost.rowCount();
For a HashJoin the call to "multiplyBaseCostByOuterRows() will return false.
Notice though that while we will not multiply the current cost (newCost) by the
outer row count for a hash join, we _will_ multiply the current _rowCount_ by
the outer row count.
So my question is this: why is it that, for a FromBaseTable, we multiply the
current rowCount by the outer row count for a Hash join (as per
FromBaseTable.estimateCost()), but for other (nonbasetable) FromTables (such
as UnionNode) we skip this multiplication (as per
HashJoinStrategy.estimateCost())? Is this an intentional difference, and if so,
does anyone know why?

3) Comparing plan costs.

In OptimizerImpl.java there is the following code for comparing "current" cost
with "best cost" for this round of optimization:
/*
** Is the cost of this join order lower than the best one we've
** found so far?
**
** NOTE: If the user has specified a join order, it will be the
** only join order the optimizer considers, so it is OK to use
** costing to decide that it is the "best" join order.
*/
if ((! foundABestPlan)  currentCost.compare(bestCost) < 0)
{
rememberBestCost(currentCost, Optimizer.NORMAL_PLAN);
// Since we just remembered all of the best plans,
// no need to reload them when pulling Optimizables
// from this join order.
reloadBestPlan = false;
}
The call to "currentCost.compare()" only compares the estimatedCost fields; it
does not look at the other two fields of a CostEstimatenamely, rowCount and
singleScanRowCount.
In running some tests against 10.1, though, I noticed that it is possible for
currentCost and bestCost to have the same estimatedCost values but different row
counts. Since the above check does not look at row counts, we can end up in
situations where the following two queries can result in different query plans:
select * from V1, V2 where V1.j = V2.b;
select * from V2, V1 where V1.j = V2.b;
I.e. the queries are identical with the exception of the order of V1 and V2. In
my particular example V1 and V2 are both unions of two tables, where V2 has
100,000 rows and V1 has less than ten. When I ran these queries against 10.1
with NO optimizer timeoutmeaning that the optimizer had time to examine all
possible query plansthe result was surprising: the first query chose join
order [ V2, V1 ] and took 4.5 to 5 seconds to complete; the second query chose
join order [ V1, V2 ] and took 3033 seconds to complete. Wow.
I made a small change to the CostEstimateImpl.compare() method to consider
rowCount as a secondary field of comparison (if the estimatedCosts are the same)
and singleScanRowCount as a third field of comparison (if the rowCounts are
the same). With that change the behavior was then what I expected: namely, the
optimizer chose the same join order (the faster one) regardless of which query I
ran.
So my question is this: if two cost estimates have the same estimatedCost but
different rowCounts (or even singleScanRowCounts), it is reasonable to suppose
that the estimate with the lower rowCount is actually better? In the above case
the query with the lower row count was the one that ran more quicklyis it
reasonable to believe that this will generally be the case?

As always, thanks to anyone with the time and inclination to read this email and
provide feedback...
Army
