db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mamta Satoor <msat...@gmail.com>
Subject Re: [jira] Created: (DERBY-251) DISTINCT query is returning duplicate rows
Date Thu, 05 May 2005 20:43:58 GMT
Hi,

I have a patch for this optimizer bug. Basically, the issue turned out
to be the logic for DISTINCT elimination. During the optimization
phase, if a query has DISTINCT clause, then impl.sql.compile.FromList
class's returnsAtMostSingleRow() method gets called. This method
returns true if the method concludes that DISTINCT in the query is
redundant (based on a complex logic that decides that the query is
going to return distinct rows on its own without the DISTINCT clause.
The details of the current logic for DISTINCT elimination can be found
in the comments at the method level.)

For the query in question in this bug, the method returned true for
DISTINCT elimination which is wrong. The explanation is as follows.

First of all, I was able to simplify the query reported in the bug to
following query.
select  distinct  q1."NO1" from IDEPT q1, IDEPT q2
where  ( q2."DISCRIM_DEPT" = 'HardwareDept')
and  ( q1."DISCRIM_DEPT" = 'SoftwareDept')  and  ( q1."NO1" <> ALL
(select  q3."NO1" from IDEPT q3 where  (q3."REPORTTO_NO" =  q2."NO1")))

This query gets converted to following during optimization
select  distinct  q1."NO1" from IDEPT q1, IDEPT q2
where  ( q2."DISCRIM_DEPT" = 'HardwareDept')
and  ( q1."DISCRIM_DEPT" = 'SoftwareDept')  and  not exists (
(select  q3."NO1" from IDEPT q3 where
(  q3."REPORTTO_NO" =  q2."NO1"  and q3."NO1" = q1."NO1") ) )  ;

This optimized query has 4 predicates associated with it
q3.reportto_no = q2.no1
q2.discrim_dept = 'HardwareDept'
q1.descrim_dept = 'SoftwareDept'
q1.no1 = q3.no1

Next, on this optimized query(since it has DISTINCT clause in it), the
returnsAtMostSingleRow() method gets called. The method incorrectly
returns true indicating that DISTINCT can be eliminated. The reason
for this is that method is looking at predicates that belong to the
inside query with the exists clause (which is on table IDEPT q3) to
determine DISTINCT elimination for the outer level.

The fix is that the predicates from the exists query, (in this
particular case, q3."NO1" = q1."NO1" and q3.reportto_no = q2.no1)
should not be considered when deciding elimination of DISTINCT in the
outer query. That is what the attached patch does.

Hope this helps understand the problem and the proposed fix for it.
The files impacted by the change are as follows
svn stat
M java\engine\org\apache\derby\impl\sql\compile\FromList.java
M java\testing\org\apache\derbyTesting\functionTests\tests\lang\distinctElimination.sql
M java\testing\org\apache\derbyTesting\functionTests\master\distinctElimination.out

Please send in comments you may have. I have run the existing tests
and the patch didn't cause any failures.

thanks,
Mamta


On 4/29/05, Mamta A. Satoor (JIRA) <derby-dev@db.apache.org> wrote:
> DISTINCT query is returning duplicate rows
> ------------------------------------------
> 
>         Key: DERBY-251
>         URL: http://issues.apache.org/jira/browse/DERBY-251
>     Project: Derby
>        Type: Bug
>  Components: SQL
>    Versions: 10.1.0.0
>    Reporter: Mamta A. Satoor
> Assigned to: Mamta A. Satoor
> 
> Following query on a table with primary key returns duplicate rows
> select  distinct  q1."NO1" from IDEPT q1, IDEPT q2
> where  ( q2."DISCRIM_DEPT" = 'HardwareDept')
> and  ( q1."DISCRIM_DEPT" = 'SoftwareDept')  and  ( q1."NO1"
> <> ALL  ( select  q3."NO1" from IDEPT q3 where  (  q3."REPORTTO_NO" =  q2."NO1")
) )  ;
> 
> The sql script to create the table and load data into it is as follows
> CREATE TABLE "APP"."IDEPT" ("DISCRIM_DEPT" VARCHAR(32), "NO1" INTEGER NOT NULL,
> "NAME" VARCHAR(50), "AUDITOR_NO" INTEGER, "REPORTTO_NO" INTEGER, "HARDWAREASSET"
> VARCHAR(15), "SOFTWAREASSET" VARCHAR(15));
> -- primary/unique
> ALTER TABLE "APP"."IDEPT" ADD CONSTRAINT "PK_IDEPT" PRIMARY KEY ("NO1");
> insert into idept values ('Dept', 1, 'Department1', null, null, null, null);
> insert into idept values ('HardwareDept', 2, 'Department2', 25, 1, 'hardwareaset2', null);
> insert into idept values ('HardwareDept', 3, 'Department3', 25, 2, 'hardwareaset3', null);
> insert into idept values ('SoftwareDept', 4, 'Department4', 25, 1, null, 'softwareasset4');
> insert into idept values ('SoftwareDept', 5, 'Department5', 30, 4, null, 'softwareasset5');
> 
> The problem appears to be in org.apache.derby.impl.sql.compile.FromList.returnsAtMostSingleRow()
method. This method along with other things tries to determine if the DISTINCT can be dropped
without causing duplicate rows. For the query in question, this method decides that DISTINCT
is not necessary which I think is incorrect.
> 
> If the table above is created with no primary key, the DISTINCT query does not return
duplicate rows. That is because one of the several criterias for dropping DISTINCT is that
there should be a unique index on the columns in the where clause.
> 
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
>   http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
>   http://www.atlassian.com/software/jira
> 
>

Mime
View raw message