db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Knut Anders Hatlen <Knut.Hat...@Sun.COM>
Subject What is the correct way to solve DERBY-504?
Date Thu, 18 Aug 2005 16:29:56 GMT

I have found a solution to DERBY-504
(http://issues.apache.org/jira/browse/DERBY-504), but I would like to
get some input, as I'm not sure the solution is quite optimal.

The short version of the bug is that in queries like

  SELECT DISTINCT name FROM (SELECT name, id FROM names) AS n

duplicates are not removed from the result. (I don't think anyone
would/should ever write such a query, but it's still a bug...)

After some investigation I found out that the bug is caused by the
optimizer passing the distinct from the top level query down to the
subquery, thereby scanning the names table for DISTINCT(name, id)
instead of DISTINCT(name).

This particular optimization is meant for cases where the duplicate
elimination could be done during the scanning of the table, and a
comment in org/apache/derby/impl/sql/compile/SelectNode.java mentions
the criteria for using this optimization:

  /* See if we can push duplicate elimination into the store
   * via a hash scan.  This is possible iff:
   *        o  A single table query
   *        o  We haven't merged the order by and distinct sorts.
   *           (Results do not have to be in a particular order.)
   *        o  All entries in the select's RCL are ColumnReferences.
   *        o  No predicates (This is because we currently do not
   *           differentiate between columns referenced in the select
   *           list and columns referenced in other clauses.  In other
   *           words, the store will do duplicate elimination based on
   *           all referenced columns.)

For the single-table-query criterion, the code checks whether the
query has only one source in its FROM clause, and then uses the same
test recursively on that source until the criterion is not met or the
source is a table.

My bug fix works by disabling the recursive check. Instead of running
the test recursively, it just checks whether there is exactly one
source in the FROM clause and that the source is a table, not a

Now, this fix works, which is probably most important, but I'm not
sure it is the best solution. At least, I still have some questions.

 1. The fix removes optimization in some cases where it could be
    advantageous. It makes sure that the top-level distinct is never
    pushed down to a subquery. Is this an acceptable trade-off? An
    alternative to this would be to make the push more
    intelligent. Right now the distinct scan is pushed to the subquery
    through ResultSetNode.markForDistinctScan(), a method which takes
    no arguments. Perhaps markForDistinctScan() could have the list of
    distinct columns as an argument?

 2. The query that exposed this bug is really a distinct select on a
    projection of a projection. A projection of a projection could
    easily be rewritten as a single projection. Should we extend the
    optimizer to handle such cases?

I would really like to hear your comments on these questions!


Knut Anders

View raw message