db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Knut Anders Hatlen <Knut.Hat...@Sun.COM>
Subject Re: What is the correct way to solve DERBY-504?
Date Thu, 18 Aug 2005 22:19:25 GMT
Rick Hillegas <Richard.Hillegas@Sun.COM> writes:

> Hi Knut,
>
> Hm, I'm not sure I understand what you've done. Pardon me if I have
> garbled your description.

Hi, thanks for answering, Rick! I'll try to clear up a few things.

> It sounds as though the optimizer has done two things:
>
> 1) Rewritten the original query by flattening the inner subquery into
> the outer SELECT.
>
> 2) Then tried to push the DISTINCT into the Store.

No, the optimizer didn't flatten the original query, it just pushed
the DISTINCT into the subquery, which then pushed it into the store.

> Perhaps the optimzer has made a flattening mistake here. It may have
> rewritten the query to be
>
> SELECT DISTINCT name, id from names
>
> when it meant to rewrite the query as
>
> SELECT DISTINCT name from names

The optimized parse tree looked like this (taken from memory):

CursorNode
 |
 +-> ProjectRestrictNode (top-level query)
      |
      +-> column: name
      |
      +-> ProjectRestrictedNode (subquery)
           |
           +-> column: name
           |
           +-> column: id
           |
           +-> FromBaseTable (names table)

So there wasn't any flattening, but the query was effectively
rewritten to

  SELECT name FROM (SELECT DISTINCT name, id FROM names)

> What I'm saying is, it't not clear to me that what's broken is
> DISTINCT push-down. The bug may really be in the subquery flattening
> logic.

It's not clear to me either. I'm not sure the optimizer even tries to
flatten such queries. I'll look more into that.

I think we only need to fix one of them. The criteria used by the
optimizer to decide whether it can push the DISTINCT, will only be met
by such simple projections and simple nested projections. So if we
flatten those queries (which should be fairly easy, I reckon), there
will never be a case where the optimizer thinks it can push the
DISTINCT to a subquery.

> I agree with your misgivings that your proposed solution is
> overbroad. I think the bug has to do with the agreement between the
> outer DISTINCT list and the inner SELECT list. They need to be
> identical. In this case the outer DISTINCT list has one column but the
> inner SELECT list has two. I think that the number of tables in the
> inner FROM list should not be relevant.

The optimizer, as it is now, only supports pushing the DISTINCT all
the way to the store (performing duplicate elimination while scanning
the table), which is why it only works with one table in the inner
FROM list. I agree that it should be possible to push the DISTINCT in
cases where there are more than one table too, but in those cases you
would just move a DistinctNode to another place in the query tree, not
completely eliminating it as when you push it to the store.

By the way, does anyone have other example queries where it would be
an optimization to push the DISTINCT to a subquery? I mean, I don't
see that it's any useful other than in these two cases:

 1) When the distinct could be pushed directly to the store (as in
    SELECT DISTINCT x,y,z FROM table), and

 2) SELECT DISTINCT on a subquery which is not joined with other
    tables (as in SELECT DISTINCT x,y,z FROM (SELECT x,y,z,....)).

1) is already supported, and 2) won't be necessary to optimize if we
implement proper flattening. In all the other cases I can come up
with, pushing is not possible. Please correct me if I'm wrong.

-- 
Knut Anders


Mime
View raw message