db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bryan Pendleton (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DERBY-5584) Select statement with subqueries with group by and count distinct statements returns wrong number of results
Date Sat, 28 Jan 2012 15:57:10 GMT

    [ https://issues.apache.org/jira/browse/DERBY-5584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13195570#comment-13195570
] 

Bryan Pendleton commented on DERBY-5584:
----------------------------------------

Thanks for the review, Dag. I'm cautiously optimistic about the patch and will continue testing
it.

Regarding the distinct limitation, Derby has had a limitation that there be at most one DISTINCT
aggregate in a query for a long time, probably ever since it was written. See, for example,
this
link from the 10.2 docs: http://db.apache.org/derby/docs/10.2/ref/rrefsqlj32693.html

        Only one DISTINCT aggregate expression per SelectExpression is allowed. 
        For example, the following query is not valid:

            SELECT AVG (DISTINCT flying_time), SUM (DISTINCT miles) FROM Flights

I'm guessing here, but I suspect that the plan for implementing such queries would involve
optimizer work *above* the GroupedAggregateResultSet. In particular, the optimizer could
(perhaps) implement the above query by doing:

1) Fetch all of Flights, and pass it through a GroupedAggregateResultSet which (a) sorts
by flying_time, (b) eliminates duplicate values, and (c) computes the average of the
unique flying_time values
2)  Fetch all of Flights, and pass it through a GroupedAggregateResultSet which (a) sorts
by miles, (b) eliminates duplicate values, and (c) computes the sum of the unique miles values.
3) Combine the two scalar values into the final result

Trying to do *all* of this in a single pass through the Flights data is complex, because you
can't simultaneously sort the data on flying_time AND on miles, so you'd have to have some
other way of eliminating duplicates, other than sorting.

For example, we could have some sort of low-level HashMap collection to record the
unique values, and have a hybrid algorithm which sorted one one distinct aggregate and
kept hash collections of the other distinct aggregates (assuming all those hashes fit in memory).

I think that, in practice, those databases which implement multiple DISTINCT aggregates
do so by having higher-level query plans which take multiple passes over the data.

In all my time with Derby, I can't recall anyone ever complaining about the DISTINCT
aggregate limitation. Moreover, in my use of databases I haven't found occasion to regularly
use DISTINCT aggregates. For whatever that's worth.

Thanks again for the review and feedback; I'll continue testing and working on the patch
(I'm out of town for a week so it may be a little while).

                
> Select statement with subqueries with group by and count distinct statements returns
wrong number of results
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: DERBY-5584
>                 URL: https://issues.apache.org/jira/browse/DERBY-5584
>             Project: Derby
>          Issue Type: Bug
>          Components: Network Server
>    Affects Versions: 10.7.1.1
>         Environment: Output from sysinfo
> java.specification.name: Java Platform API Specification
> java.specification.version: 1.6
> java.runtime.version: 1.6.0_20-b02
> --------- Derby Information --------
> JRE - JDBC: Java SE 6 - JDBC 4.0
> [/home/piotrz/Desktop/db-derby-10.7.1.1-bin/lib/derby.jar] 10.7.1.1 - (1040133)
> [/home/piotrz/Desktop/db-derby-10.7.1.1-bin/lib/derbytools.jar] 10.7.1.1 - (1040133)
> [/home/piotrz/Desktop/db-derby-10.7.1.1-bin/lib/derbynet.jar] 10.7.1.1 - (1040133)
> [/home/piotrz/Desktop/db-derby-10.7.1.1-bin/lib/derbyclient.jar] 10.7.1.1 - (1040133)
> ------------------------------------------------------
> ----------------- Locale Information -----------------
> Current Locale :  [English/United States [en_US]]
> Found support for locale: [cs]
> 	 version: 10.7.1.1 - (1040133)
> Found support for locale: [de_DE]
> 	 version: 10.7.1.1 - (1040133)
> Found support for locale: [es]
> 	 version: 10.7.1.1 - (1040133)
> Found support for locale: [fr]
> 	 version: 10.7.1.1 - (1040133)
> Found support for locale: [hu]
> 	 version: 10.7.1.1 - (1040133)
> Found support for locale: [it]
> 	 version: 10.7.1.1 - (1040133)
> Found support for locale: [ja_JP]
> 	 version: 10.7.1.1 - (1040133)
> Found support for locale: [ko_KR]
> 	 version: 10.7.1.1 - (1040133)
> Found support for locale: [pl]
> 	 version: 10.7.1.1 - (1040133)
> Found support for locale: [pt_BR]
> 	 version: 10.7.1.1 - (1040133)
> Found support for locale: [ru]
> 	 version: 10.7.1.1 - (1040133)
> Found support for locale: [zh_CN]
> 	 version: 10.7.1.1 - (1040133)
> Found support for locale: [zh_TW]
> 	 version: 10.7.1.1 - (1040133)
>            Reporter: Piotr Zgadzaj
>            Assignee: Bryan Pendleton
>         Attachments: patch1.txt, query.log, tests.out, tests.sql, try1.txt
>
>
> Steps to reproduce:
> 1. Create database, connect to database with any JDBC client
> 2. create two tables:
> CREATE TABLE TEST_5 (
>        profile_id INTEGER NOT NULL,
>        group_ref INTEGER NOT NULL,
>        matched_count INTEGER NOT NULL
>    );
>    CREATE TABLE TEST_6 (
>        profile_id INTEGER NOT NULL,
>        group_ref INTEGER NOT NULL,
>        matched_count INTEGER NOT NULL
>    );
> 3. Insert two records for each table:
> insert into test_5 values (1, 10000,1);
> insert into test_5 values (2, 10000, 2);
> insert into test_6 values (1, 10000,1);
> insert into test_6 values (2, 10000, 2);
> 4. Run following statement
> SELECT *
> FROM
>  (SELECT ps1.group_ref,
>    COUNT(DISTINCT ps1.matched_count) AS matched_count
>  FROM test_5 ps1
>  GROUP BY ps1.group_ref,
>    ps1.profile_id
>  ) a,
>  (SELECT ps2.group_ref,
>    COUNT( DISTINCT ps2.matched_count) AS matched_count
>  FROM test_6 ps2
>  GROUP BY ps2.group_ref,
>    ps2.profile_id
>  ) b
> As a result I've got 3 records instead of 4 - at least Oracle 10g
> returns 4 records for this statement. Maybe i'm doing something wrong.
> Do you have any suggestions / possible workarounds for this problem

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message