db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rick Hillegas <Richard.Hille...@Sun.COM>
Subject Re: What is the correct way to solve DERBY-504?
Date Fri, 19 Aug 2005 13:59:57 GMT
Hi Knut,

Thanks for the clarification. Sounds like we agree that the bug is in 
the query rewriting rather than in the pushdown to the Store. Methinks 
the DISTINCT should not be pushed into the subselect if the inner and 
outer SELECT lists aren't identical.

Here's another possible scenario where you might push a DISTINCT from an 
outer into an inner query. It all depends on when the optimizer decides 
to eliminate the DISTINCT altogether:

SELECT DISTINCT maxChildAge, employeeID FROM
(
    SELECT max( childAge ) maxChildAge, employeeID
    FROM children, employees
    WHERE children.parentID=employeeID
    GROUP BY employeeID
)

Here the optimizer might reasonably do something like this:

1) Push the DISTINCT into the subselect.

2) Then notice that the results of the original subselect are already 
unique because of the GROUP BY

3) So eliminate the DISTINCT altogether.

Here the subquery joins two tables. You can imagine other queries in 
which the subquery joins multiple tables and the optimizer does the 
following:

1) Pushes the DISTINCT into the subselect.

2) Notices that, based on the selectivity of the join columns, the 
subselect already returns unique results.

3) So eliminates the redundant DISTINCT.

Cheers,
-Rick

Knut Anders Hatlen wrote:

>Rick Hillegas <Richard.Hillegas@Sun.COM> writes:
>
>  
>
>>Hi Knut,
>>
>>Hm, I'm not sure I understand what you've done. Pardon me if I have
>>garbled your description.
>>    
>>
>
>Hi, thanks for answering, Rick! I'll try to clear up a few things.
>
>  
>
>>It sounds as though the optimizer has done two things:
>>
>>1) Rewritten the original query by flattening the inner subquery into
>>the outer SELECT.
>>
>>2) Then tried to push the DISTINCT into the Store.
>>    
>>
>
>No, the optimizer didn't flatten the original query, it just pushed
>the DISTINCT into the subquery, which then pushed it into the store.
>
>  
>
>>Perhaps the optimzer has made a flattening mistake here. It may have
>>rewritten the query to be
>>
>>SELECT DISTINCT name, id from names
>>
>>when it meant to rewrite the query as
>>
>>SELECT DISTINCT name from names
>>    
>>
>
>The optimized parse tree looked like this (taken from memory):
>
>CursorNode
> |
> +-> ProjectRestrictNode (top-level query)
>      |
>      +-> column: name
>      |
>      +-> ProjectRestrictedNode (subquery)
>           |
>           +-> column: name
>           |
>           +-> column: id
>           |
>           +-> FromBaseTable (names table)
>
>So there wasn't any flattening, but the query was effectively
>rewritten to
>
>  SELECT name FROM (SELECT DISTINCT name, id FROM names)
>
>  
>
>>What I'm saying is, it't not clear to me that what's broken is
>>DISTINCT push-down. The bug may really be in the subquery flattening
>>logic.
>>    
>>
>
>It's not clear to me either. I'm not sure the optimizer even tries to
>flatten such queries. I'll look more into that.
>
>I think we only need to fix one of them. The criteria used by the
>optimizer to decide whether it can push the DISTINCT, will only be met
>by such simple projections and simple nested projections. So if we
>flatten those queries (which should be fairly easy, I reckon), there
>will never be a case where the optimizer thinks it can push the
>DISTINCT to a subquery.
>
>  
>
>>I agree with your misgivings that your proposed solution is
>>overbroad. I think the bug has to do with the agreement between the
>>outer DISTINCT list and the inner SELECT list. They need to be
>>identical. In this case the outer DISTINCT list has one column but the
>>inner SELECT list has two. I think that the number of tables in the
>>inner FROM list should not be relevant.
>>    
>>
>
>The optimizer, as it is now, only supports pushing the DISTINCT all
>the way to the store (performing duplicate elimination while scanning
>the table), which is why it only works with one table in the inner
>FROM list. I agree that it should be possible to push the DISTINCT in
>cases where there are more than one table too, but in those cases you
>would just move a DistinctNode to another place in the query tree, not
>completely eliminating it as when you push it to the store.
>
>By the way, does anyone have other example queries where it would be
>an optimization to push the DISTINCT to a subquery? I mean, I don't
>see that it's any useful other than in these two cases:
>
> 1) When the distinct could be pushed directly to the store (as in
>    SELECT DISTINCT x,y,z FROM table), and
>
> 2) SELECT DISTINCT on a subquery which is not joined with other
>    tables (as in SELECT DISTINCT x,y,z FROM (SELECT x,y,z,....)).
>
>1) is already supported, and 2) won't be necessary to optimize if we
>implement proper flattening. In all the other cases I can come up
>with, pushing is not possible. Please correct me if I'm wrong.
>
>  
>


Mime
View raw message