Mailing-List: contact derby-dev-help@db.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Derby Development" <derby-dev@db.apache.org>
Received-SPF: pass (asf.osuosl.org: local policy)
Date: Thu, 18 Aug 2005 10:25:42 -0700
From: Rick Hillegas <Richard.Hillegas@Sun.COM>
Subject: Re: What is the correct way to solve DERBY-504?
In-reply-to: <wjo8d5ob5htn.fsf@sun.com>
To: Derby Development <derby-dev@db.apache.org>
Message-id: <4304C496.7000202@sun.com>
MIME-version: 1.0
Content-type: text/plain; charset=ISO-8859-1; format=flowed
Content-transfer-encoding: 7BIT
User-Agent: Mozilla Thunderbird 1.0.6 (Windows/20050716)
References: <wjo8d5ob5htn.fsf@sun.com>

Hi Knut,

Hm, I'm not sure I understand what you've done. Pardon me if I have 
garbled your description.

It sounds as though the optimizer has done two things:

1) Rewritten the original query by flattening the inner subquery into 
the outer SELECT.

2) Then tried to push the DISTINCT into the Store.

Perhaps the optimzer has made a flattening mistake here. It may have 
rewritten the query to be

SELECT DISTINCT name, id from names

when it meant to rewrite the query as

SELECT DISTINCT name from names

What I'm saying is, it't not clear to me that what's broken is DISTINCT 
push-down. The bug may really be in the subquery flattening logic.

I agree with your misgivings that your proposed solution is overbroad. I 
think the bug has to do with the agreement between the outer DISTINCT 
list and the inner SELECT list. They need to be identical. In this case 
the outer DISTINCT list has one column but the inner SELECT list has 
two. I think that the number of tables in the inner FROM list should not 
be relevant.

Regards,
-Rick


Knut Anders Hatlen wrote:

>Hi,
>
>I have found a solution to DERBY-504
>(http://issues.apache.org/jira/browse/DERBY-504), but I would like to
>get some input, as I'm not sure the solution is quite optimal.
>
>The short version of the bug is that in queries like
>
>  SELECT DISTINCT name FROM (SELECT name, id FROM names) AS n
>
>duplicates are not removed from the result. (I don't think anyone
>would/should ever write such a query, but it's still a bug...)
>
>After some investigation I found out that the bug is caused by the
>optimizer passing the distinct from the top level query down to the
>subquery, thereby scanning the names table for DISTINCT(name, id)
>instead of DISTINCT(name).
>
>This particular optimization is meant for cases where the duplicate
>elimination could be done during the scanning of the table, and a
>comment in org/apache/derby/impl/sql/compile/SelectNode.java mentions
>the criteria for using this optimization:
>
>  /* See if we can push duplicate elimination into the store
>   * via a hash scan.  This is possible iff:
>   *        o  A single table query
>   *        o  We haven't merged the order by and distinct sorts.
>   *           (Results do not have to be in a particular order.)
>   *        o  All entries in the select's RCL are ColumnReferences.
>   *        o  No predicates (This is because we currently do not
>   *           differentiate between columns referenced in the select
>   *           list and columns referenced in other clauses.  In other
>   *           words, the store will do duplicate elimination based on
>   *           all referenced columns.)
>
>For the single-table-query criterion, the code checks whether the
>query has only one source in its FROM clause, and then uses the same
>test recursively on that source until the criterion is not met or the
>source is a table.
>
>My bug fix works by disabling the recursive check. Instead of running
>the test recursively, it just checks whether there is exactly one
>source in the FROM clause and that the source is a table, not a
>subquery.
>
>Now, this fix works, which is probably most important, but I'm not
>sure it is the best solution. At least, I still have some questions.
>
> 1. The fix removes optimization in some cases where it could be
>    advantageous. It makes sure that the top-level distinct is never
>    pushed down to a subquery. Is this an acceptable trade-off? An
>    alternative to this would be to make the push more
>    intelligent. Right now the distinct scan is pushed to the subquery
>    through ResultSetNode.markForDistinctScan(), a method which takes
>    no arguments. Perhaps markForDistinctScan() could have the list of
>    distinct columns as an argument?
>
> 2. The query that exposed this bug is really a distinct select on a
>    projection of a projection. A projection of a projection could
>    easily be rewritten as a single projection. Should we extend the
>    optimizer to handle such cases?
>
>I would really like to hear your comments on these questions!
>
>Thanks!
>
>  
>