db-derby-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Db-derby Wiki] Update of "DecoratedPermutations" by Army
Date Tue, 07 Nov 2006 22:48:05 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Db-derby Wiki" for change notification.

The following page has been changed by Army:
http://wiki.apache.org/db-derby/DecoratedPermutations

New page:
As described on the [http://wiki.apache.org/db-derby/LanguageOptimize Optimizer Overview]
page, there are three primary methods in Derby's !OptimizerImpl class that encapsulate the
general process of trying to find the "best join order" for a given !OptimizerImpl's list
of Optimizables. Those three methods are:

  - getNextPermutation()

  - getNextDecoratedPermutation()

  - costPermutation()

This page describes the general flow of code as it occurs when a class such as !SelectNode
or !TableOperatorNode calls the getNextDecoratedPermutation() method of an !OptimizerImpl.
 The expectation is that such a call will occur as part of a block of code similar to the
following (copied from !SelectNode.optimize()):

{{{
/* Optimize this SelectNode */
while (optimizer.getNextPermutation())
{
    while (optimizer.getNextDecoratedPermutation())
    {
        optimizer.costPermutation();
    }
}
}}}

Thus when we reach {{{optimizer.getNextDecoratedPermutation()}}} we know that:

  * !OptimizerImpl has an array of Optimizables representing a (potentially incomplete) join
order, and
  * an Optimizable has just been placed at some position in the join order array (see JoinOrderPermutations).

The job of getNextDecoratedPermutation(), then, is to find the next "decoration" for the Optimizable
that was most recently placed in {{{optimizer}}}'s join order.  Note that a "decorated permutation"
is also referred to as an "access path" for an Optimizable. An access path has three primary
"fields" in Derby:

  1. a conglomerate
  1. a join strategy
  1. a cost estimate

If the Optimizable is a base table, then its access path determines which (if any) index to
use and also indicates which join strategy is most appropriate for that table. Or put another
way, both the "conglomerate" and the "join strategy" fields are non-null. If, on the other
hand, the Optimizable is not a base table (for example, if it is a union or a subquery) then
the ''logical'' access path consists of the combined access paths from the Optimizable's children
result sets, plus the best join strategy for that Optimizable.  That said, such an Optimizable's
"access path" will only have an assigned join strategy; the conglomerate field will be null
since conglomerates only exist for base tables.

As can be seen by the inner loop in the above the code fragment, the idea is to loop through
all access paths for the Optimizable in question, finding a cost estimate for each such access
path.  So in the case of a !FromBaseTable the call to {{{optimizer.getNextDecoratedPermutation()}}}
will return "true" {{{(j*(ix+1))}}} times, where {{{j}}} is the number of join strategies
that Derby supports and {{{ix}}} is the number of indexes that exist on the !FromBaseTable.
 For an Optimizable that is not a !FromBaseTable, the call to {{{optimizer.getNextDecoratedPermutation()}}}
will return "true" {{{j}}} times--once for each of Derby's join strategies. Note: Currently
Derby only supports two join strategies--nested loop and [http://wiki.apache.org/db-derby/HashJoinsInDerby
hash]--so {{{j}}} will always be two (2).

All of that said, each call to {{{OptimizerImpl.getNextDecoratedPermutation()}}} does the
following four tasks:

  1. Find the next available access path ("decoration") for the Optimizable in question.
  1. Ensure that the cheapest access path found for the Optimizable so far (i.e. since entering
the inner {{{while}}} loop in the above code snippet) is stored in the Optimizable's "best
access path" field before returning.
  1. Take note of the "running total" cost estimate for the current join order.
  1. If the Optimizable is placed at the last position in !OptimizerImpl's join order AND
if there are no more access paths for the Optimizable, then check to see if the cost of the
'''complete''' join order is the lowest so far. If so, "remember" that join order as the best
one for the !OptimizerImpl.

== Task 1: Next Available Access Path ==

The first of the tasks mentiond above--i.e. finding the next access path, or "decoration",
for an Optimizable--is the one which most directly correlates to the name of the method itself.
 It is perhaps a tad surprising, then, that this task is accomplished via a single line in
the getNextDecoratedPermutation() method, namely:

{{{
/* Returns true until all access paths are exhausted */
retval =  curOpt.nextAccessPath(this,
                        (OptimizablePredicateList) null,
                        currentRowOrdering);
}}}

In other words, just as every Optimizable in an !OptimizerImpl's list is responsible for calculating
its own [http://wiki.apache.org/db-derby/PermutationCosting estimated cost], every Optimizable
is also responsible for finding out what its next access path is.  Or more specifically, every
Optimizable must define--either directly or indirectly through inheritance--the "nextAccessPath()"
method.

Unlike estimating a cost, though, the process of finding a "next access path" is fairly uniform
from one Optimizable to the next.  This follows from the fact that there are really just two
cases to consider: the case of a base table (represented by a !FromBaseTable in Derby) and
the case of a non-base table (any Optimizable that is not a !FromBaseTable).  It is not surprising,
then, that there are only two classes in Derby that actually have a {{{nextAccessPath()}}}
method defined:

  * !FromBaseTable
  * !FromTable (parent to all Optimizables)

Well okay, that's not strictly true.  The !ProjectRestrictNode class also defines a {{{nextAccessPath()}}}
method.  However,  all that method does is call the method of the same name on the !ProjectRestrictNode's
child (if the child is an instance of Optimizable) or else on the !ProjectRestrictNode's superclass
(which will end up being FromTable).  So in the end, we always come back to the {{{nextAccessPath()}}}
method as defined in either !FromBaseTable or !FromTable.

That said, {{{FromTable.nextAccessPath()}}} is where we iterate through the different join
strategies supported by Derby.  As of Derby 10.2 a user can use "optimizer overrides" to specify
what join strategy the Optimizer should use for a specific Optimizable.  Thus the code in
!FromTable is such that:

  * If the user specified a join strategy, the first call to nextAccessPath() will set the
"join strategy" field of the Optimizable's current access path to the strategy specfied by
the user, and then will return "true"; the second call will return "false".
  * otherwise, the first call to nextAccessPath() will set the "join strategy" field of the
Optimizable's current access path to the first join strategy in the list of strategies supported
by Derby, and then will return "true".  Subsequent calls will iterate through the remaining
strategies one-by-one until there are no more, in which case the method will return "false".

The current list of join strategies supported by Derby can be found in the "getOptimizer()"
method of !OptimizerFactoryImpl.java, of which the relevant lines are shown here:

{{{
/* Get/set up the array of join strategies.
 */
if (joinStrategySet == null)
{
    JoinStrategy[] jss = new JoinStrategy[2];
    jss[0] = new NestedLoopJoinStrategy();
    jss[1] = new HashJoinStrategy();
    joinStrategySet = jss;
}
}}}

>>From this we can see that, as stated above, Derby only supports two join strategies
today: !NestedLoop and Hash.

Similarly, the {{{FromBaseTable.nextAccesPath()}}} method is where we iterate through the
indexes for a specific base table.  As with join strategies, Derby 10.2 and later allows the
user to optionally specify a particular index to use for a base table.  Thus the code in !FromBaseTable
is such that:

  * If the user specified a specific index to use, the first call to nextAccessPath() will
set the "conglomerate" field of the Optimizable's current access path so that it corresponds
to the index specfied by the user, and then will return "true"; the second call will return
"false".
  * otherwise, the first call to nextAccessPath() will set the "conglomerate" field of the
Optimizable's current access path to the first conglomerate that Derby knows about for the
base table in question, and then will return "true". Subsequent calls will iterate through
the remaining conglomerates one-by-one until there are no more, in which case the method will
return "false".

Note that every base table has at least one "heap" conglomerate representing a table scan.
 So if no indexes are present on a base table, nextAccessPath() will select the heap conglomerate.

== Task 2: An Optimizable's "Best Access Path" ==

If we again look at the loop shown at the top of this page, we can see that the looped calls
to getNextDecoratedPermutation() are separated by calls to {{{optimizer.costPermutation()}}}.
 The [http://wiki.apache.org/db-derby/PermutationCosting costPermutation()] method effectively
tells the Optimizable to estimate its cost given its current position in the join order and
its current access path.  Upon completion of that call the "cost estimate" field of the Optimizable's
current access path will hold the estimated cost as calculated by costPermutation().  In addition,
the Optimizable's "best access path" will contain the conglomerate (if applicable), the join
strategy, and the estimated cost for whatever the '''best''' access path is so far.  If the
current access path is also the best access path, then the "current" and "best" access paths
for the Optimizable will have the same conglomerate, join strategy, and cost estimate.

So how do we keep track of "best access path" for an Optimizable?  As described [http://wiki.apache.org/db-derby/PermutationCosting
here], a call to the {{{OptimizerImpl.costPermutation()}}} method will eventually result in
a call to the "optimizeIt()" method of the Optimizable in question.  As it turns out, every
call to optimizeIt() will, after calculating the estimated cost, make a call to one of the
following two methods defined in !OptimizerImpl:
 
  * considerCost() -- for non-base tables
  * costBasedCostOptimizable() -- for base tables

Both of these methods contain code to compare two access paths--namely, "current access path"
and "best access path" for the Optimizable in question.  More specifically, the methods check
three things:

  1. Is the current access path's join strategy feasible with the current Optimizable?
  1. Can the current access path execute within the memory constraints that we have?
  1. Is the current access path cheaper than the "best" path found so far?

If the answer to all three of these questions is "Yes", then the current access path is saved
as the "best access path" for the Optimizable.

But we cannot stop there.  We said above that the ''logical'' access path for an Optimizable
that is not a base table (ex. if it is a union or a subquery) consists of the combined access
paths from the Optimizable's children result sets, plus the best join strategy for that Optimizable.
 So in the case of nested subqueries, for example, it is not enough to just compare "current"
and "best" access paths for the Optimizable itself.  If we let the Optimizable in question
be {{{OPT_0}}}, then we also have to ensure that the "best access path" for all of {{{OPT_0}}}'s
children result sets are correct, as well.  Or put another way, we have to ensure that all
Optimizables in the subtree '''beneath''' {{{OPT_0}}} correctly correlate with the "best access
path" found for {{{OPT_0}}}.  This is the second task for which getNextDecoratedPermutation()
is responsible.

=== Justification for Task 2 ===

Before describing more about this particular task, let's look at an example that demonstrates
why it is not good enough to just call considerCost() or costBasedCostOptimizable().  Assume
we have the following query:

{{{
select * from
    (select * from t1 union select * from t2) X1,
    (select * from t3 union select * from t4) X2
where X1.j = X2.b;
}}}

Trimming out a vast amount of detail (much of which is already covered in this and other wiki
pages), if we were to skip over the second task for getNextDecoratedPermutation() then we
might see the following INCORRECT scenario.  Note that cost estimates here don't mean anything,
they're just numbers used to show relative costs.

   1. Outer-most !OptimizerImpl picks join order {X1, X2}.
   1. First permutation for X2 is "nested loop join", so we push predicates to the base tables.
   1. We optimize X2 assuming nested loop join and pushed predicates, and we get a cost estimate
of 50 using index scans on T3 and T4.  So "best path" for T3 and T4 is now "index scan" with
an associated cost.
   1. We call X2.considerCost() which saves "nested loop join with cost 50" as the best path
for X2.
   1. Second permutation for X2 is "hash join"; so we pull predicates back up.
   1. We optimize X2 assuming hash join and no pushed predicates, and we get a cost of 75
using table scans on T3 and T4 (because the predicates were not pushed).  So "best path" for
T3 and T4 is now "table scan" with an associated cost.
   1. We call X2.considerCost() which sees that 75 is worse than 50 and so does _not_ change
X2's best path.  Thus X2's best path is still "nested loop join with cost 50".  '''But'''
we do not walk X2's subtree to revert the "best paths" of its children, and thus T3 and T4's
best paths are still "table scan".
   1. X2 returns "nested loop join with cost 50" as its "best path" to the outer !OptimizerImpl.
   1. Outer !OptimizerImpl doesn't have a "best" join order yet, so it saves "X1, X2" as it's
best join order, which means we will walk the subtree and save all "best paths" as "truly
the best paths" for X1 and X2 (see "Task 4" below).  This means that we'll save "nested loop
join" for X2, "table scan" for T3 and "table scan" for T4.

For simplicity let's say that we've finished optimization at this point.  We'll then generate
a query plan that does a nested loop join between X1 and X2 and even pushes the predicates
down to T3 and T4, which is good.  *However*, instead of using the predicates to do index
scans, which is what we wanted to do, we'll end up doing table scans.  We will still use the
predicates as qualifiers for the table scan, but that's not what the !OptimizerImpl for X2
was intending--we were supposed to use the predicates to do an index scan.  This mix-up comes
from the fact that we didn't properly revert the plans of the nodes beneath X2 when X2.considerCost()
rejected the hash join.  

=== Task 2: Some Details ===

In order to avoid the incorrect behavior that we just described, we have to have logic to
save the "best paths" for an Optimizable's subtree before each new permutation.  Then if the
current permutation isn't the best one, we need to revert the entire subtree's paths back
to what they were for the "best" permutation.

As mentioned above, the "considerCost()" method is where we decide if the current access path
for a non-base table is the best one so far.  Thus it is also the method around which we need
to add logic for saving/reverting access paths.  As it turns out, there are only four classes
in the Derby code that call !OptimizerImpl.considerCost(): !FromTable, !ProjectRestrictNode,
!JoinNode, and !UnionNode.  In each case, the call to considerCost() comes from within the
"optimizeIt()" method.  So these are the four places where we need to save an Optimizable's
access path before trying out a new permutation.  As an example we can see the following in
!UnionNode.java:

{{{
    /* It's possible that a call to optimize the left/right will cause
     * a new "truly the best" plan to be stored in the underlying base
     * tables.  If that happens and then we decide to skip that plan
     * (which we might do if the call to "considerCost()" below decides
     * the current path is infeasible or not the best) we need to be
     * able to revert back to the "truly the best" plans that we had
     * saved before we got here.  So with this next call we save the
     * current plans using "this" node as the key.  If needed, we'll
     * then make the call to revert the plans in OptimizerImpl's
     * getNextDecoratedPermutation() method.
     */
    updateBestPlanMap(ADD_PLAN, this);
}}}

There is similar code in the other three classes, as well.  Then, as explained in the comment,
the logic to revert the plans (if needed) is found in the getNextDecoratedPermutation() method.
 So the second task of that method is accomplished with the following code:

{{{
/* If the previous path that we considered for curOpt was _not_ the best
 * path for this round, then we need to revert back to whatever the
 * best plan for curOpt was this round.  Note that the cost estimate
 * for bestAccessPath could be null here if the last path that we
 * checked was the only one possible for this round.
 */
if ((curOpt.getBestAccessPath().getCostEstimate() != null) &&
    (curOpt.getCurrentAccessPath().getCostEstimate() != null))
{
    /* Note: we can't just check to see if bestCost is cheaper
     * than currentCost because it's possible that currentCost
     * is actually cheaper--but it may have been 'rejected' because
     * it would have required too much memory.  So we just check
     * to see if bestCost and currentCost are different.  If so
     * then we know that the most recent access path (represented
     * by "current" access path) was not the best.
     */
    if (curOpt.getBestAccessPath().getCostEstimate().compare(
        curOpt.getCurrentAccessPath().getCostEstimate()) != 0)
    {
        curOpt.updateBestPlanMap(FromTable.LOAD_PLAN, curOpt);
    }
    else if (curOpt.getBestAccessPath().getCostEstimate().rowCount() <
        curOpt.getCurrentAccessPath().getCostEstimate().rowCount())
    {
        /* If currentCost and bestCost have the same cost estimate
         * but currentCost has been rejected because of memory, we
         * still need to revert the plans.  In this case the row
         * count for currentCost will be greater than the row count
         * for bestCost, so that's what we just checked.
         */
        curOpt.updateBestPlanMap(FromTable.LOAD_PLAN, curOpt);
    }
}
}}}

Let's again look at the scenario for the example given above, but this time we want to take
into account the "Task 2" code..  Here is the query again:

{{{
select * from
    (select * from t1 union select * from t2) X1,
    (select * from t3 union select * from t4) X2
where X1.j = X2.b;
}}}

The resultant scenario is as follows.  Note that lines preceded by "++" are the result of
the {{{updateBestPlanMap(ADD_PLAN, this)}}} code mentioned above.

  1. Outer-most !OptimizerImpl picks join order {X1, X2}.
  1. First permutation for X2 is "nested loop join", so we push predicates to the base tables.
  1. ++ We save the current "best paths" for X2, T3, and T4--since this is our first time
optimizing, we don't have a best path yet, so this is a no-op.
  1. We optimize X2 assuming nested loop join and pushed predicates, and we get a cost estimate
of 50 using index scans on T3 and T4.  So "best path" for T3 and T4 is now "index scan" with
an associated cost.
  1. We call X2.considerCost() which saves "nested loop join with cost 50" as the best path
for X2.
  1. Second permutation for X2 is "hash join"; so we pull predicates back up.
  1. ++ We save the current "best paths" (from the previous permutation) for X2, T3, and T4--so
X2 is "nested loop", T3 is "index scan", and T4 is "index scan".
  1. We optimize X2 assuming hash join and no pushed predicats, and we get a cost of 75 using
table scans on T3 and T4 (because the predicates were not pushed).  So "best path" for T3
and T4 is now "table scan" with an associated cost.
  1. We call X2.considerCost() which sees that 75 is worse than 50 and so does _not_ change
X2's best path.  Thus X2's best path is still "nested loop join with cost 50".
  1. ('''This step is the result of the "Task 2" code.''') Since "hash join" was the most
recently optimized permutation, we'll check to see if it yielded the best access path.  Since
the best path gave cost 50 and the "hash join" gave cost 75, we see that the latest permutation
("hash join") did _not_ yield the best cost.  So we revert the plans of X2 AND T3 AND T4 to
the values that we saved in step 7.  Thus X2 reverts to "nested loop", T3 reverts to "index
scan", and T4 reverts to "index scan".
  1. X2 returns "nested loop join with cost 50" to the outermost Optimizable.
  1. Outermost !OptimizerImpl doesn't have a "best" join order yet, so it saves "X1, X2" as
it's best join order, which means we will walk the subtree and save all "best paths" as "truly
the best paths" for X1 and X2 (see "Task 4" below).  This means that we'll save "nested loop
join" for X2, "index scan" for T3 and "index scan" for T4.

If we again assume that we're done optimizing, we will now generate the desired plan--namely,
a nested loop join between X1 and X2 with index scans (using the pushed predicates) on T3
and T4.

== Task 3: A Join Order's Running Total Cost ==

The first two tasks performed by getNextDecoratedPermutation() deal with the most-recently-placed
Optimizable in the join order, as placed by a call to [http://wiki.apache.org/db-derby/JoinOrderPermutations
getNextPermutation()]. The third and fourth tasks, though, deal more with the !OptimizerImpl
itself.

The third task for which getNextDecoratedPermutation() is responsible is that of adding the
cost of an Optimizable's "best access path" to the !OptimizerImpl's running total for the
current join order.  This happens after all of the decorations for the most-recently-placed
Optimizable have been exhausted, which in turn means the best of those decorations is stored
in the Optimizable's "best access path" field (see "Task 2" above).

So if, for example, the current join order is [0 3 2 -1] and we have gone through all of the
decorations for Optimizable "2", that Optimizable's best access path will contain the conglomerate,
the join strategy, and the estimated cost for the decoration that returned the lowest cost
estimate.  What we want to do, then, is add that cost to !OptimizerImpl's running total. 
Once all Optimizables have been placed in the join order, the running total will represent
the total estimated cost for the complete join order.  That total is then used to determine
what the overall best join order for the !OptimizerImpl is (as described in "Task 4" below).

The code for adding an Optimizable's best cost estimate to the running total is expectedly
straightforward, as seen here:

{{{
/*
** Add the cost of the current optimizable to the total cost.
** The total cost is the sum of all the costs, but the total
** number of rows is the number of rows returned by the innermost
** optimizable.
*/
currentCost.setCost(
    currentCost.getEstimatedCost() + ce.getEstimatedCost(),
    ce.rowCount(),
    ce.singleScanRowCount());

if (curOpt.considerSortAvoidancePath() &&
    requiredRowOrdering != null)
{
    /* Add the cost for the sort avoidance path, if there is one */
    ce = curOpt.getBestSortAvoidancePath().getCostEstimate();

    currentSortAvoidanceCost.setCost(
        currentSortAvoidanceCost.getEstimatedCost() +
            ce.getEstimatedCost(),
        ce.rowCount(),
        ce.singleScanRowCount());
}
}}}

Note that !OptimizerImpl actually keeps track of two different costs: one for the "normal"
cost estimate and one for the "sort avoidance" cost.  Since we do not, at this point in the
code, know yet if we are going to use a sort avoidance plan, we have to make sure we keep
the running total for *both* of these costs current.

== Task 4: Remembering a Best Join Order ==

If the most-recently-placed Optimizable is placed at the last position in !OptimizerImpl's
join order AND if there are no more access paths for that Optimizable, then getNextDecoratedPermutation()
has the task of checking to see if the cost of the '''complete''' join order is the lowest
so far.  If so, the method makes the required call to "remember" that join order as the best
one for the !OptimizerImpl:

{{{
/* Do we have a complete join order? */
if ( joinPosition == (numOptimizables - 1) )
{
    ...
    /*
    ** Is the cost of this join order lower than the best one we've
    ** found so far?
    */
    if ((! foundABestPlan) ||
        (currentCost.compare(bestCost) < 0) ||
        bestCost.isUninitialized())
    {
        rememberBestCost(currentCost, Optimizer.NORMAL_PLAN);

        // Since we just remembered all of the best plans,
        // no need to reload them when pulling Optimizables
        // from this join order.
        reloadBestPlan = false;
    }
    else
        reloadBestPlan = true;
    ...
}
}}}

The "rememberBestCost()" method then does all of the following:

  * Saves the current cost as the "bestCost" for the !OptimizerImpl
  * Saves the current join order as the "bestJoinOrder" for the !OptimizerImpl
  * Iterates through all Optimizables in the list and tells each one to save its "best access
path" as its "''truly'' the best access path".  When we say "truly the best access path" here
we are referring to an Optimizable's best access path as it exists for the current '''round'''
of optimization (see JoinOrderPermutations for the definition of a "round").  This is different
from the "best access path" mentioned in the task descriptions above, where we are talking
about the best path for the current '''permutation'''.

Note that when cost-based optimization is complete and we move on to the "modification of
access paths" phase as described in the [http://wiki.apache.org/db-derby/LanguageOptimize
Optimizer Overview], the "''truly'' best access path" is the one on which modification of
the query tree is based.

One final thing worth noting is that the above code fragment sets a boolean variable, {{{reloadBestPlan}}},
as part of the "remembering" work.  This flag is used in the {{{OptimizerImpl.getNextPermutation()}}}
method when pulling an Optimizable out of the join order.  See the "Removing an Optimizable"
section on [http://wiki.apache.org/db-derby/JoinOrderPermutations this page] for more on why
plan "reload" is necessary.

== Returning "false" ==

When a call to getNextDecoratedPermutation() completes the third (and potentially the fourth)
task as described above, we know that we have exhausted all "decorations" for the most-recently-placed
Optimizable in the !OptimizerImpl's join order.  At that point the method returns "false",
thereby breaking the inner {{{while}}} loop shown at the top of this page and bringing us
back to {{{optimizer.getNextPermutation()}}}, which is described in detail [http://wiki.apache.org/db-derby/JoinOrderPermutations
here].

Mime
View raw message