db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "RPost" <rp0...@pacbell.net>
Subject Re: Optimizer hints?
Date Wed, 22 Dec 2004 00:57:29 GMT
Thanks Mike and Satheesh. Your comments are getting closer to what I had in
mind when I suggested that we document a specific need that we expect the
use of hints to fix. (By the way I am not opposed to introducing hints I
just think we ought to document a precise need).

An approach that I have seen groups have success with in the past is:

1. State the problem as clearly and concisely as possible.
2. Enumerate the key issues that illustrate the problem.
3. Gather some metrics that validate the problem and verify the extent of
the problem.
4. Propose a solution to the problem.
5. Implement and test the proposed solution.

In my comments I was essentially saying that we were proposing a solution
(step 3) before we had a clear problem statement and an illustration of at
least one problem case that might benefit from a solution.

I am assuming that the purpose of a 'hint' is to nudge Derby to select an
optimization choice that it might not otherwise select. If so, this suggests
that in order to provide a proper set of hints we must first know what
choices Derby has now, the process Derby uses to make a choice and an
example query that illustrates Derby making the wrong choice. If we can't do
that then there is no need for hints.

Can someone with Derby experience try to correct and expand my (feeble)
attempts to document steps 1, 2, 3, 4 and 5 above?

1. Problem - Derby does not always make the best choice when optimizing
    a. What is the list of choices that Derby currently has available? Hash
Join, Nested Loop Join, Table Join Order, Other? Are all choices available
for all query types (insert, update, delete, ddl, batch)?
    b. What process does Derby use currently to choose among the
alternatives available? Is this processs the same for all query types
(insert, etc)?
    c. Is there currently a way to document (e.g. explain plan, log file,
debug messages) what optimization choice Derby uses for a particular query.

2. Possible key issues
    a. Queries that must sort 1 or multiple tables, queries with large
numbers of tables, queries with one or more large (what is large?) tables,
queries on tables with no indexes on join columns.
    b. Does the type of result set (readonly, updateable, scrollable, etc)
affect the optimizer's choice?
    c. Does the amount of memory affect the choice.
    d. other issues

3. Gather some metrics
    a. This step will not only define the extent of the problem but will
point us to appropriate hints or fixes.

4. Proposed solution - add a means of providing hints to the optimizer.
    a. Will the hints be 'suggestions' (Index 'A' might help or join table B
first) or 'commands' (do a full table scan, use index 'A')? That is, how
much weight will a hint carry?
    b. Will the hints be provided for all query types (insert, etc)?
    c. Will the hints be used only for prepared statements or for all
    d. How many different types of hints might be used? join strategy, index
use, full table scan use, join order?

5. Implement and test a proposed solution
    a. create some test cases - this will also include gathering metrics
(step 3) for the test cases
    b. run the test cases with and without the hints
    c. validate that the solution produces measurable results.

I think that a very useful first step might be to introduce some way of
forcing Derby to use a particular optimization. How else will we be able to
get the metrics we need to know if Derby is making a good choice?

A good example is Gerald's problem. In his query it appears that Derby was
using, or trying to use, a HashJoin strategy and running out of memory. When
he forced Derby to use a NestedLoop join strategy the query not only ran but
ran very quickly.

Unfortunately Gerald had to modify code in order to get the metrics that
identified the problem and suggested a solution. If there were a way to
force Derby to use a HashJoin, and then to rerun the same query and force
Derby to use a NestedLoop join he could have gotten the metrics needed with
little effort.

Ironically, one method of forcing Derby to use a particular join strategy is
to introduce some form of 'hint' that is either mandatory or has sufficient
weight to achieve the desired effect. This could be achieved by adding a new
property or option that could be used during integration testing.

----- Original Message ----- 
From: "Mike Matrigali" <mikem_app@sbcglobal.net>
To: "Derby Development" <derby-dev@db.apache.org>
Sent: Tuesday, December 21, 2004 3:01 PM
Subject: Re: Optimizer hints?

> All the database systems I have worked on have had optimizer hints,
> while I agree it would be best if we didn't have to ever have them I
> think they are a good way to be able to provide a quick fix while we
> improve the optimizer to not need them.  I agree with all the posts that
> encourage people to provide as much detail as possible to their problems
> with query plans.
> I think it is a good idea to somehow introduce the hints such that it
> does not affect the standard SQL.  Seems like it might be nice to make
> it some sort of XML that might fit with an eventual XML output of query
> plans so that tools could read and write them.
> As a zero admin db derby should have the goal of making the hints not
> necessary, but in the real world we may not get there - so I see them as
> a necessary evil.  I also think it makes supporting the product easier
> as one can force a test case when maybe the customers data that caused
> that particular query plan shape is not available.
> The area of the optimizer may provide some interesting work for those
> interested out there.  I am not an expert in the optimizer but here are
> some areas I have thought may be interesting:
> o add a sort merge option to the optimizer choices
> o do a better job choosing when to invalidate a plan and cause another
> optimization try.  Currently we just use a row count metric and a number
> of iterations metric, and always when a ddl operation affects one of the
> tables.  Some options might be:
>     1) is it really necessary to recompile a single table query,
> non-join query other than at ddl?
>     2) can we use actual stats gathered during execution and compare
> them with estimates to see if we should change estimates and recompile?
>     3) Currently we recompile after a set number of iterations, should
> we bump the default number - or just get rid of it all together?
>     4) xml query plans
> o Update the costing information at least to current reality of the
> system (raw numbers have not been run for a few releases).  Maybe change
> units of costing so that other software products could provide costing.
> o add support in the optimizer to choose an index based on a function
> rather than a key (and add the function index support in language/store).
> Satheesh Bandaram wrote:
> > Right... I also remember seeing a large query (with 40+ table
> > references) where Derby optimizer was spending a lot of time figuring
> > out the right join-order. All the tables were really small in this case
> > and forcing a join-order saved optimization time and  improved total
> > query execution time significantly.
> >
> > The fact that most (if not all) database vendors have optimizer hints
> > should show its need ...
> >
> > Satheesh
> >
> > Mike Matrigali wrote:
> >
> >
> >>I don't have specific queries, but I can give examples where the
> >>optimizer may not have the right information to pick a good plan:
> >>
> >>1) queries involving tables with no indexes have no data distribution
> >>~   maintained by the system, so any guess the optimizer makes may be
> >>~   wrong for a dataset.
> >>2) queries involving indexes basically use the shape of the tree to
> >>~   estimate data distribution.  The cost information for an estimate
> >>~   of number of rows between key1 and key2, assumes uniform
> >>~   throughout the tree.
> >>3) because of #2, multi-column indexes (key1, key2) don't provide data
> >>~   distribution information on key2 - so again the optimizer just uses
> >>~   a default distribution guess.
> >>
> >>RPost wrote:
> >>| Do you have any specific examples of queries that could benefit from
> >>| optimizer hint?
> >>|
> >>| Does anyone know of any documented instances where Derby has selected
> >>| execution plan that is ineffective or suboptimal? Were these instances
> >>able
> >>| to be corrected by tuning the system using existing
> >>| properties?
> >>|
> >>| Oracle allows hints to be provided as SQL comments.
> >>|
> >>| Re providing an offline tool. Does this method imply that the
> >>| exist independent of any given running database instance. If so are
> >>| proposing that a prepared statement be saved and made available for
> >>future
> >>| execution without preparing it again?
> >>|
> >>| ----- Original Message -----
> >>| From: "Satheesh Bandaram" <satheesh@Sourcery.Org>
> >>| To: "Derby Development" <derby-dev@db.apache.org>
> >>| Sent: Friday, December 17, 2004 2:41 PM
> >>| Subject: Optimizer hints?
> >>|
> >>|
> >>|
> >>| I have been thinking about adding optimizer hints to Derby. Though
> >>| optimizer does perform a reasonable job, it may be useful to have
> >>| optimizer hints for cases 1) when updated index statistics is not
> >>| available, or even incorrect 2) Rare cases, like this one?, when the
> >>| optimizer doesn't do the job right 3) when applications issue queries
> >>| not designed for Derby (like very large number of tables).
> >>|
> >>| Derby optimizer is primarily tasked to handle 1) access method
> >>| to use table scan or index) 2) join order and 3) join strategy (nested
> >>| loop or HashJoin) A complete optimizer hints (should they actually be
> >>| called optimizer overrides?) should be able to provide means of
> >>| specifying all these. There are several ways to do this, including
> >>|
> >>|    1. Enhance the SQL to recognize additional keywords, like
> >>| These properties could specify optimizer hints. That is what
> >>| had before, but this causes non-portable SQL to be written. Attempts
> >>| run these statements against any other SQL server could cause syntax
> >>| errors. Not an ideal solution, according to me.
> >>|    2. Provide optimizer hints as SQL comments. These comments are
> >>| recognized only by Derby parser and other SQL engines would simply
> >>| ignore these. There are some limitations in Derby parser currently to
> >>| implement this option.
> >>|    3. Provide an offline tool which registers hints for SQL
> >>| When those SQL statements are executed, Derby could look up previously
> >>| registered hints for those statements and automatically apply them.
> >>| Advantage of this scheme is that applications don't need to be
> >>| to add hints and they can easily be removed when not needed. How
> >>| to register the hints could be worked out. I personally prefer this
> >>| approach.
> >>|
> >>| Any comments?
> >>|
> >>| Satheesh
> >>|
> >>| Gerald Khin (JIRA) wrote:
> >>|
> >>|
> >>|>HashJoinStrategy leads to java.lang.OutOfMemoryError
> >>|>----------------------------------------------------
> >>|
> >>|>        Key: DERBY-106
> >>|>        URL: http://nagoya.apache.org/jira/browse/DERBY-106
> >>|>    Project: Derby
> >>|>       Type: Bug
> >>|>   Reporter: Gerald Khin
> >>|
> >>|
> >>|>My application is running out of memory: I encounterd a
> >>|
> >>| java.lang.OutOfMemoryError. I used -Xmx256M. Unfortunatley, I cannot
> >>| spend an arbitrary amount of JVM memory.
> >>|
> >>|>Then, I commented out the line in class OptimizerFactoryImpl which was
> >>|
> >>| adding the HashJoinStrategy to the set of Join strategies:
> >>|
> >>|>       if (joinStrategySet == null)
> >>|>       {
> >>|>//            JoinStrategy[] jss = new JoinStrategy[2];
> >>|>           JoinStrategy[] jss = new JoinStrategy[1];
> >>|>           jss[0] = new NestedLoopJoinStrategy();
> >>|>//            jss[1] = new HashJoinStrategy();
> >>|>           joinStrategySet = jss;
> >>|>       }
> >>|
> >>|>And with these changes the OutOfMemoryError has gone away! And it
> >>|
> >>| even with -Xmx128M!!!
> >>|
> >>|>So I guess that there is a major memory issue with this HashJoin
> >>|
> >>| strategy implementation.
> >>|
> >>|>If it turns out to be too complicated to make the memory consumption
> >>|
> >>| more predicatble or even bounded to some configurable limit, then I
> >>| at least as a workaround a way to turn off the HashJoin strategy
> >>| completely: I did it by patching and building my own derby.jar, but if
> >>| there would be an official solution with some kind of switch like a
> >>| system property, it would be great!
> >>|
> >>|
> >>|
> >
> >
> >

View raw message