db-derby-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Db-derby Wiki] Update of "PermutationCosting" by Army
Date Tue, 31 Oct 2006 17:49:52 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Db-derby Wiki" for change notification.

The following page has been changed by Army:

New page:
As described on the [http://wiki.apache.org/db-derby/LanguageOptimize Optimizer Overview]
page, there are three primary methods in Derby's !OptimizerImpl class that encapsulate the
general process of trying to find the "best join order" for a given !OptimizerImpl's list
of Optimizables. Those three methods are:

  - getNextPermutation()

  - getNextDecoratedPermutation()

  - costPermutation()

This page describes the general flow of code as it occurs when a class such as !SelectNode
or !TableOperatorNode calls the costPermutation() method of an !OptimizerImpl.  The expectation
is that such a call will occur as part of a block of code similar to the following (copied
from !SelectNode.optimize()):

/* Optimize this SelectNode */
while (optimizer.getNextPermutation())
    while (optimizer.getNextDecoratedPermutation())

Thus when we reach {{{ optimizer.costPermutation() }}} we know that:

  * !OptimizerImpl has an array of Optimizables representing a (potentially incomplete) join
  * an Optimizable has just been placed at some position in the join order array, and
  * a specific "decoration", or access path, has been chosen for the most-recently-placed
Optimizable (hereafter referred to as "MRP-Optimizable").

At this point, then, what we want to do is estimate the cost of having MRP-Optimizable at
its current position in the join order with the current choice of join strategy and, if applicable,
the current choice of index/heap.

A look at the code in {{{ OptimizerImpl.costPermutation() }}} shows that it does three things.
 First it figures out what the "outerCost" for MRP-Optimizable should be.  Then it checks
to see if the current choice of join strategy is feasible for MRP-Optimizable.  And finally,
it makes a call to the "optimizeIt()" method of MRP-Optimizable, which is where the cost estimate
is calculated.

=== Finding the "outer cost" ===

To start with, the "outer cost" for a given MRP-Optimizable is implicitly defined to be the
cost of the left side of the join for which MRP-Optimizable is the right side.  This follows
from the fact that Derby only considers left-deep trees when estimating costs.  So if, for
example, we have a join order that is {T2, T1, T4} and T4 is our MRP-Optimizable, the join
tree looks like:

     /      \
   JOIN[1]   T4
  /    \
 T2     T1

and the "outer cost" for T4 is simply the cost of JOIN[1].  In addition to this definition
of "outer cost", the code in !OptimizerImpl implicitly defines the "estimated cost" of an
Optimizable Opt[n] as follows:

Let "n" be a zero-based position in the join order and let Opt[n] be the Optimizable that
is placed at position "n".  The "estimated cost" of Opt[n] is:

  * if {{{ (n == 0) }}} then it is simply the cost of accessing the rows in Opt[n]
  * else, it is the cost of accessing the rows in Opt[n] and joining them with the rows from
Opt[n-1] using whatever join strategy is chosen as part of the "access path" for Opt[n].

So to continue our example, the "estimated cost" of T1 is meant to describe the cost of joining
T2 with T1 using whatever join strategy is chosen as part of T1's access path.  Similarly,
the estimated cost of T4 is the cost of joining T4 with the result of the join between T2
and T1, using whatever join strategy is chosen as part of T4's access path.

What this means is that, according the definitions of "outer cost" and "estimated cost", the
"outer cost" for a given Optimizable is simply the cost of the preceding Optimizable in the
join order, or 1.0d if the Optimizable is the first in the join order (i.e. {{{ n == 0 }}}).
 Thus the outer cost of T2 is 1.0d, the outer cost of T1 is the estimated cost of T2, and
the outer cost of T4 is the estimated cost of JOIN[1], which is really just the estimated
cost of T1 (according to the definition of "estimated cost").

In the actual code for !OptimizerImpl.costPermutation(), this logic is encapsulated by the

** Get the cost of the outer plan so far.  This gives us the current
** estimated rows, ordering, etc.
CostEstimate outerCost;
if (joinPosition == 0)
    outerCost = outermostCostEstimate;
    outerCost =
            proposedJoinOrder[joinPosition - 1]).

=== Join Strategy Feasibility ===

After determining the outer cost for MRP-Optimizable, the second thing !OptimizerImpl does
as part of its "costPermutation()" method is check to see that the join strategy for the current
access path is feasible:

** Don't consider non-feasible join strategies.
if ( ! optimizable.feasibleJoinStrategy(predicateList, this))

By this time "optimizable", which is our MRP-Optimizable, has some associated access path
for which we want to estimate the cost, and that access path includes a choice of join strategy.
 Derby currently only supports two join strategies, nested loop and hash, represented by the
classes !NestedLoopJoinStrategy.java and !HashJoinStrategy.java, respectively.

The call to {{{ optimizable.feasibleJoinStrategy() }}} will ultimately result in a call to
the "feasible()" method of the join strategy that is part of MRP-Optimizable's current access
path.  The !JoinStrategy class then determines whether or not the join that it represents
is feasible for MRP-Optimizable.  As a general rule, a nested loop join is always feasible
(though there are some corner-cases; see the code comments in {{{ NestedLoopJoinStrategy.feasible()
}}}).  A hash join is only feasible if there exists an equijoin predicate between MRP-Optimizable
and some other Optimizable that precedes MRP-Optimizable in the join order.  See HashJoinsInDerby
for more details on hash join processing during Derby optimization.

=== Calculating a Cost Estimate ===

After finding the "outer cost" for MRP-Optimizable and verifying that the current join strategy
is feasible, !OptimizerImpl then makes the call to estimate the cost of MRP-Optimizable at
its current position in the join order with its current access path ("decoration"):

/* Cost the optimizable at the current join position */

This is the last line of the !OptimizerImpl.costPermutation() method.

The thing to note here is that !OptimizerImpl does not directly compute the estimated cost
of the Optimizable.  Instead, the Optimizable is responsible for calculating its *own* estimated
cost based on the information passed in from the !OptimizerImpl to which the it belongs (i.e.
using the arguments passed from the code shown above).  An !OptimizerImpl then takes the estimated
cost calculated by the Optimizable and uses that estimate to calculate the overall estimated
cost for the join order in question. That said, it is worth mentioning that the above code
does not retrieve or save the actual cost estimate anywhere.  This is because the Optimizable
is also responsible for saving its own estimated cost, along with saving all information that
describes the access path corresponding to the estimated cost.  An !OptimizerImpl may ask
for the estimated cost or access path information for a specific Optimizable, but that information
is *not* stored within the !OptimizerImpl.  This
  is why we see things like:

outerCost =
        proposedJoinOrder[joinPosition - 1]).

in many of the !OptimizerImpl's methods.  In this example the !OptimizerImpl wants to know
what the best cost estimate for a specific Optimizable is.  In order to get that information
the !OptimizerImpl has to retrieve the target Optimizable from its list and specifically "ask"
that Optimizable for the estimated cost of its best access path so far.  That's what the above
code snippet does.

Okay, so backing up a step, we said that an Optimizable is responsible for calculating its
own estimated cost. As part of this responsibility the Optimizable must also ensure that it
makes the necessary call(s) to its child(ren) result set(s) so that they, in turn, can calculate
their estimated costs, as well.  Depending on the type of of the result set, this usually
means that the Optimizable will call one of "optimize()" or "optimizeIt()" on every child
result set that it has. This effectively means that a call to {{{ optimizable.optimizeIt()
}}} from !OptimizerImpl will yield a cost estimate for the entire subtree whose root is the
Optimizable in question (i.e. MRP-Optimizable).

As an example, assume MRP-Optimizable is a UNION between two SELECT queries.  Then as part
of its "optimizeIt()" method the UNION must make a call to the "optimize()" or "optimizeIt()"
method of both its left child AND its right child.  The UNION will then calculate its own
estimated cost based on the costs of its children, and that final cost is what the !UnionNode
will save for its current access path. We can get a glimpse of how this happens by looking
at the following code snippet, pulled from the {{{ UnionNode.optimizeIt() }}} method:

leftResultSet = optimizeSource(

rightResultSet = optimizeSource(

CostEstimate costEstimate = getCostEstimate(optimizer);

/* The cost is the sum of the two child costs */
                     leftResultSet.getCostEstimate().singleScanRowCount() +

costEstimate.add(rightResultSet.costEstimate, costEstimate);

In this case the "optimizeSource()" method comes from !UnionNode's parent class, !TableOperatorNode,
in which we see one of two things, depending on the class of child nodes in question.  For
a child that is itself an Optimizable, the "optimizeSource()" method creates a new instance
of !OptimizerImpl and enters a loop that is identical to the while loop shown at the top of
this page, thereby leading to recursive optimization.  For a child that is not an Optimizable
the "optimizeSource()" method makes the following call:

retval = sourceResultSet.optimize(

and thereby fulfills !UnionNode's job of calling "optimize()" or "optimizeIt()" on every child.

What all of this means is that there is no central place in the Derby code where all calculation
of estimated cost occurs.  Just as every result set node in the tree has to define its own
"preprocess()" and "modifyAccessPaths()" methods (see [http://wiki.apache.org/db-derby/LanguageOptimize
Optimizer Overview]), every result set has to define--either directly or through inheritance--its
own "optimizeIt()" or "optimize()" method, as well.  Depending on the type of result set in
question, the amount and kind of work that is done to estimate the cost can vary greatly.

That said, though, the "leaves" of most query trees are most often made up of base tables.
 For this reason it is worth looking more at the code flow that occurs for a base table in
Derby, which is represented by an instance of !FromBaseTable.  In {{{ FromBaseTable.optimizeIt()
}}} we see the following:


In other words, a !FromBaseTable actually calls back to the !OptimizerImpl to which it belongs.
 At first glance this may seem to contradict the notion of every Optimizable calculating its
own estimated cost. If we keep going, though, we see that the "costOptimizable()" method of
!OptimizerImpl goes through several layers of method calls and ultimately ends up at the private
method "costBasedCostOptimizable()" in the same class.  In that method the !OptimizerImpl
does two things:

  1. it first calls "estimateTotalCost()", which in turn calls the "estimateCost()" method
of the Optimizable in question--i.e. it calls ''*back*'' to the !FromBaseTable for calculation
of the estimated cost. In this way the calculation is again left up to the Optimizable, as

  1. it then determines if the result of the call to "estimateCost()" on the Optimizable is
the best one for that Optimizable so far.  If so, the !OptimizerImpl will tell the Optimizable
to save the current cost (and the associated access path) as the best so far.

Given this code flow, it is not surprising that for most queries, the bulk of the low-level
costing information and logic for Derby originates in the {{{ FromBaseTable.estimateCost()
}}} method.  The cost estimates returned from that method are then passed back up the query
tree to the various result set nodes that rely on it. So in the case of a UNION with two SELECT
queries, the cost of each SELECT query is built on the estimated costs returned for the base
tables in each query; the cost of the UNION is then build on the costs of the SELECT queries.

Eventually we end up back where we started in !OptimizerImpl--namely, back at:

/* Cost the optimizable at the current join position */

Having completed this call, "optimizable"--which is our MRP-Optimizable--now knows what the
estimated cost for its current access path is.  We then return from the {{{ optimizer.costPermutation()
}}} method shown in the "while" loop at the top of this page, and continue processing as described
[http://wiki.apache.org/db-derby/LanguageOptimize here].

View raw message