Mailing-List: contact derby-dev-help@db.apache.org; run by ezmlm
Precedence: bulk
Reply-To: <derby-dev@db.apache.org>
Message-ID: <1786682183.1208195705066.JavaMail.jira@brutus>
Date: Mon, 14 Apr 2008 10:55:05 -0700 (PDT)
From: "A B (JIRA)" <jira@apache.org>
To: derby-dev@db.apache.org
Subject: [jira] Commented: (DERBY-3603) 'IN' clause ignores valid results,
 incorrect qualifier handling suspected
In-Reply-To: <1214723879.1207620925545.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/DERBY-3603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12588661#action_12588661 ] 

A B commented on DERBY-3603:
----------------------------

Thank you very much for picking this one up, Bryan.  I greatly appreciate it.

On the subqueryFlattening.diff:

To see why the diff occurs, I stepped through the following exercise.  I did not actually trace through the code, but I walked through what I *think* should be happening and it seems to have explained the diff...

At the time of the statements in question the data in the DOCS and COLLS tables is as follows:

DOCS: ID (VARCHAR) =>
  ('124'), ('24'), ('25'), ('27'), ('36'), ('567')

COLLS: ID (VARCHAR), COLLID (SMALLINT) =>
  ('123', 2), ('124', -5), ('24', 1), ('26', -2), ('36', 1), ('37', 8)

Now assume the query in question is:

  SELECT count(ID) FROM DOCS
    WHERE ID <> ANY (SELECT ID FROM COLLS WHERE COLLID IN (-2,1))

The ANY will be flattened giving us something to the effect of:

  SELECT count(ID) FROM DOCS, COLLS
    WHERE DOCS.ID <> COLLS.ID AND COLLS.COLLID IN (-2, 1)

The IN list for the queries that we're talking about is (-2, 1). So per the explanation I posted on Friday, we'd end up with something like the following withOUT the change.

Note: Lines preceded by "+" indicate that we read a row from the inner table, COLLS.  Lines preced by "*' indicate that what we do for the line changes with the proposed patch for this issue.  Lines preceded by an "x" indicate that we "filter" a row from COLLS, meaning we read it but then we discard it.

  - Get DOCS row ('124').
  - Get first IN list value, -2.
  + Get row from COLLS where COLLID is -2, i.e. ('26', -2).
  - See if ID column of ('26', -2) is NOT EQUAL to '124'.  They
    are not equal so count the '124' row from DOCS and quit the
    scan of COLLS.
  - Get DOCS row ('24').
  * Get next IN list value, 1.
  + Get row from COLLS where COLLID is 1, i.e. ('24', 1).
  x See if ID column of ('24', 1) is NOT EQUAL to '24'.  They
    are EQUAL so we skip the row ('24', 1) and continuing our
    scan of COLLS.
  + Get next row from COLLS where COLLID is 1, i.e. ('36', 1).
  - See if ID column of ('36', 1) is NOT EQUAL to '24'.  They
    are not equal so count the '24' row from DOCS and quit the
    scan of COLLS.
  - Get DOCS row ('25').
  - Implicitly RESET the probe state, then get the "next" IN
    list value, which will be -2.
  + Get row from COLLS where COLLID is -2, i.e. ('26', -2).
  - See if ID column of ('26', -2) is NOT EQUAL to '25'.  They
    are not equal so count the '25' row from DOCS and quit the
    scan of COLLS.
  - Get DOCS row ('27').
  * Get next IN list value, 1.
  + Get row from COLLS where COLLID is 1, i.e. ('24', 1).
  - See if ID column of ('24', 1) is NOT EQUAL to '27'.  They
    are not equal so count the '27' row from DOCS and quit the
    scan of COLLS.
  - Get DOCS row ('36').
  - Implicitly RESET the probe state, then get the "next" IN
    list value, which will be -2.
  + Get row from COLLS where COLLID is -2, i.e. ('26', -2).
  - See if ID column of ('26', -2) is NOT EQUAL to '36'.  They
    are not equal so count the '36' row from DOCS and quit the
    scan of COLLS.
  - Get DOCS row ('567').
  * Get next IN list value, 1.
  + Get row from COLLS where COLLID is 1, i.e. ('24', 1).
  - See if ID column of ('24', 1) is NOT EQUAL to '567'.  They
    are not equal so count the '567' row from DOCS and quit the
    scan of COLLS.
  - No more rows from DOCS, so done.

We can see from this that we get a row from COLLS a total of 7 times. In the process we "filtered" one row out, i.e. the row ('24', 1), because that row has the same ID has a row in DOCS.  Or put in terms of the scan statistics, we have 7 "rows seen" and one (1) row filtered.  And the query returns a count total of 6 rows from DOCS.

But now if we do the same exercise with the patch applied, we see the following:

  - Get DOCS row ('124').
  - Get first IN list value, -2.
  + Get row from COLLS where COLLID is -2, i.e. ('26', -2).
  - See if ID column of ('26', -2) is NOT EQUAL to '124'.  They
    are not equal so count the '124' row from DOCS and quit the
    scan of COLLS.
  - Get DOCS row ('24').
  * RESET probe state, then get "next" IN list value, which
    is -2 (again).
  + Get row from COLLS where COLLID is -2, i.e. ('26', 1).
  - See if ID column of ('26', -2) is NOT EQUAL to '24'.  They
    are not equal so count the '24' row from DOCS and quit the
    scan of COLLS.
  - Get DOCS row ('25').
  * RESET probe state, then get "next" IN list value, which
    is -2 (again).
  + Get row from COLLS where COLLID is -2, i.e. ('26', 1).
  - See if ID column of ('26', -2) is NOT EQUAL to '25'.  They
    are not equal so count the '25' row from DOCS and quit the
    scan of COLLS.
  - Get DOCS row ('27').
  * RESET probe state, then get "next" IN list value, which
    is -2 (again).
  + Get row from COLLS where COLLID is -2, i.e. ('26', 1).
  - See if ID column of ('26', -2) is NOT EQUAL to '27'.  They
    are not equal so count the '27' row from DOCS and quit the
    scan of COLLS.
  - Get DOCS row ('36').
  * RESET probe state, then get "next" IN list value, which
    is -2 (again).
  + Get row from COLLS where COLLID is -2, i.e. ('26', 1).
  - See if ID column of ('26', -2) is NOT EQUAL to '36'.  They
    are not equal so count the '36' row from DOCS and quit the
    scan of COLLS.
  - Get DOCS row ('567').
  * RESET probe state, then get "next" IN list value, which
    is -2 (again).
  + Get row from COLLS where COLLID is -2, i.e. ('26', 1).
  - See if ID column of ('26', -2) is NOT EQUAL to '567'.  They
    are not equal so count the '567' row from DOCS and quit the
    scan of COLLS.
  - No more rows from DOCS, so done.

Now the total number of rows we "see" from COLLS is 6 instead of 7.  This is because we never fetched the row ('24', 1) from the table.  Instead, we only had to fetch the one row, ('26', -2), over and over.  Then, since we never fetched row ('24', 1) from the table we didn't have to filter it out.  So we end up with a total of 6 "rows seen" and 0 rows filtered. Meanwhile the query still returns a count total of 6 rows from DOCS.

I didn't run through this exercise for the other cases, but I imagine (hope!) things would work out in a similar fashion...Does that seem like a reasonable explanation to you?  If so, then I think it's okay to update the master file for subqueryFlattening.sql for this issue.

Oh, and thanks for writing the standalone repro, as well!

> 'IN' clause ignores valid results, incorrect qualifier handling suspected
> -------------------------------------------------------------------------
>
>                 Key: DERBY-3603
>                 URL: https://issues.apache.org/jira/browse/DERBY-3603
>             Project: Derby
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 10.3.2.1, 10.4.1.1
>            Reporter: David Butterworth
>            Assignee: Bryan Pendleton
>         Attachments: d3603_v1.patch, d3603_with_test.diff, derbydb.jar, derbydb.tar.bz2, subqueryFlattening.diff
>
>
> Derbys' 'IN' clause is returning different results depending on which side of a joined table
> I am doing my 'IN' comparison against. This only occurs when the number of items within the 'IN' clause is greater then 1.
> This behaviour was also confirmed by Bryan Pendleton in this thread:
> http://mail-archives.apache.org/mod_mbox/db-derby-user/200804.mbox/%3c47FA5974.2060705@amberpoint.com%3e
> Using the test database attatched the following 2 queries produce the issue:
> ij>  select count(*) FROM spike.accounts account, spike.admin_units admin_unit,
>     spike.bookings booking
>     WHERE booking.child_id = 2 AND
>     admin_unit.admin_unit_id IN (1,21) AND
>     booking.booking_date_time_out >= 20080331000000 AND
>     booking.booking_date_time_in <= 20080406235900 AND
>     account.account_id = booking.account_id AND
>     admin_unit.admin_unit_id = account.admin_unit_id;
> 1          
> -----------
> 2          
> 1 row selected
> ij> select count(*) FROM spike.accounts account, spike.admin_units admin_unit,
>     spike.bookings booking
>     WHERE booking.child_id = 2 AND
>     account.admin_unit_id IN (1,21) AND
>     booking.booking_date_time_out >= 20080331000000 AND
>     booking.booking_date_time_in <= 20080406235900 AND
>     account.account_id = booking.account_id AND
>     admin_unit.admin_unit_id = account.admin_unit_id;
> 1          
> -----------
> 3          
> 1 row selected
> ij> 
> The only difference between the 2 statements is which side of a join the 'IN' clause is matched against.
> Bryan performed some initial testing and stated the following:
> --------------------- SNIP ------------------------
> Interestingly, although the actual results do NOT contain any values
> for admin_unit_id = 21, if I change the query to:
>     admin_unit.admin_unit_id IN (1)
> or
>     account.admin_unit_id IN (1)
> then the problem disappears -- I get 3 rows for both queries.
> I also ran query plans for both the queries (in the IN (1,21) case)
> and have pasted the (simplified) query plans at the end of this message.
> I notice that in the case where the query gives 2 rows, which is
> when we specify admin_unit.admin_unit_id in (1,21), the admin_unit_id
> index scan output in the query plan contains:
>            qualifiers:
> Column[0][0] Id: 0
> Operator: =
> Ordered nulls: false
> Unknown return value: false
> Negate comparison result: false
> However, in the case where the query gives 3 rows, which is
> when we specify account.admin_unit_id in (1,21), the admin_unit_id
> index scan output in the query plan contains:
>            qualifiers:
> None
> I think it is the presence/absence of this qualifier on the query
> scan which is causing the different results in the query, as in
> the first case we see:
>            Number of rows qualified=2
>            Number of rows visited=3
> but in the second case we see:
>            Number of rows qualified=3
>            Number of rows visited=3
> I definitely don't have any explanation for why you are getting
> this odd behavior; it certainly seems like a bug to me.
> -------------END SNIP -----------------------

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.