db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "A B (JIRA)" <j...@apache.org>
Subject [jira] Commented: (DERBY-2256) Wrong Results: Use of decimal values in an IN-list with INTEGER left operand can lead to extra rows.
Date Wed, 21 Mar 2007 22:34:32 GMT

    [ https://issues.apache.org/jira/browse/DERBY-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12482938
] 

A B commented on DERBY-2256:
----------------------------

It looks like the various incorrect results reported in this Jira are all caused by one of
two different but related problems.

Problem I) Compile-time comparisons.

The first problem exists in the preprocessing code of InListOperatorNode, where an IN list
comprised exclusively of constant literals undergoes three potential transformations:

  0. First, if the list has a single value in it, it will be replaced by
     an equality predicate, and that's it.  So an IN list such as
     "... i in (3)" will become "i = 3".
  1. Else, the list will be sorted in ascending order.
  2. Then, in the special case where the minimum and maximum values are
     the same, the IN list will be replaced with a simple equality
     predicate.  So an IN list such as " ... i in (2, 2, 2)" will actually
     turn into an equality of the form "i = 2".

The problem here is that the code to sort the list, and also the code to find the min/max
values, always does comparisons using the type of the left operand, as can be seen from the
following comment in the code:

    /* When sorting or choosing min/max in the list, if types
     * are not an exact match, we use the left operand's type
     * as the "judge", assuming that they are compatible, as
     * also the case with DB2.
     */

But it turns out that this is *not* always the correct thing to do--and despite what the comment
says, this does not appear to be what DB2 does, either (all of the queries referenced in this
Jira issue return the correct results on DB2 v8).

To see why this is wrong, let's assume we have a table T1 with an integer column "I" whose
rows are simply:

    I
    ----
    1
    2
    3
    4
    5

Now let's take the following query:

  select * from t1 where i in (4.23, 4);

In this case "i" is the left operand, so the left operand's type is INTEGER.  Derby will then
bubble-sort the IN list values using INTEGER as the "judge" type for comparisons.  When doing
integer comparisons with decimal values, Derby will compare with a _truncated_ version of
the decimal(s), which means that we will effectively end up with something like:

  if (trunc(4.23) > 4)
    <swap "4.23" with "4">

which becomes:

  if (4 > 4)
    <swap "4.23" with "4>

Since 4 is not greater than 4, the "if" condition will return false and thus we will not swap
the values; we'll just leave them as they are and consider them "sorted"--which is wrong.

If we keep going, we will then check to see if the mininum value is equal to the maximum value,
and if so we'll rewrite the IN list as an equality predicate.  That said, the code assumes
that the values have been correctly sorted, in which case the minimum value should simply
be the first value in the list.  So Derby will at this point think that the minimum value
is "4.23" (which is wrong).

So in pseudo-code we'll have:

  if (minVal == maxVal)
    <transform to equality>

which becomes

  if (trunc(4.23) == 4)
    <transform to equality>

which becomes

  if (4 == 4)
    <transform to equality>

As a result, we transform the IN list into a simple equality of the form "i = <minValue>".
 But because of the incorrect sort we still think that our minimum value is "4.23", so the
query that we ultimately end up executing is:

  select * from t1 where i = 4.23

Thus the query returns no rows when it should have returned one.  Notice that if the IN list
values are reversed, i.e.:

  select * from t1 where i in (4, 4.23);

then all of the same (incorrect) logic will apply, except that the minimum value will be "4"
instead of "4.23" (because "4" appears first in the list).  Thus the query gets rewritten
to:

  select * from t1 where i = 4

which returns the expected row (somewhat by accident).  So this explains the behavior posted
by Yip in an earlier comment.

Problem II) Execution-time comparisons.

Even if the incorrect logic described above is fixed, we will still have cases where certain
queries return the wrong results.  The reason is that the execution-time logic for IN lists
has the same problem as the compilation-time logic: namely, Derby does all of the IN-list
comparisons using the left operand's type, which is wrong.  The relevant code can be found
in the DataType.in() method, where we start with the following comment:

    /* Do a binary search if the list is ordered until the 
     * range of values to search is 3 or less.
     *
     * NOTE: We've ensured that the IN list and the left all have
     * the same precedence at compile time.  If we don't enforce 
     * the same precendence then we could get the wrong result
     * when doing a binary search.
     */

Ironically enough, this comment makes reference to the problem at hand here: if we don't enforce
the correct precedence when comparing values, we can get wrong results.  The problem is that
the "NOTE" in this comment is WRONG--despite what it says, we actually did *NOT* ensure that
the IN list and the left operand had the same precedence at compile time.  Even if we assume
that the preprocessing logic is correct, it doesn't actually *change* the types of any of
the IN-list values--so when we get to execution, the values *can* in fact have different type
precedences.  Because of this broken assumption, the rest of the code in DataType.in() fails
in certain cases.

As an example we can use the same T1 mentioned above, but this time let our query be the following:

  select * from t1 where i in (2.8, 4.23)

Notice that the IN-list values are already in sorted order and that, when truncated, "2.8"
does NOT equal "4.23"--so the incorrect preprocessing described above will (accidentally)
do the correct thing.  But now when we get to execution, we're going to search the IN list
values for each of the values in T1(i).  Since the left operand is INTEGER, all comparisons
will be integer-based comparisons, meaning that they will use truncated versions of the decimal
values.  So we'll see something like:

  Search for "1":

    (1 == trunc(2.8))  --> (1 == 2) --> false (no match)
    (1 == trunc(4.23)) --> (1 == 4) --> false (no match)

  Search for "2":

    (2 == trunc(2.8))  --> (2 == 2) --> true (MATCH -- which is WRONG)

  Search for "3":

    (3 == trunc(2.8))  --> (3 == 2) --> false (no match)
    (3 == trunc(4.23)) --> (3 == 4) --> false (no match)

  Search for "4":

    (4 == trunc(2.8))  --> (4 == 2) --> false (no match)
    (4 == trunc(4.23)) --> (4 == 4) --> true (MATCH -- which is WRONG)

  Search for "5":

    (5 == trunc(2.8))  --> (5 == 2) --> false (no match)
    (5 == trunc(4.23)) --> (5 == 4) --> false (no match)

Thus the query will return two rows when in fact it should return NONE.

One final note here: when this issue was first created, the above query actually only returned
1 row, not two.  That was because Derby used to create an additional BETWEEN predicate for
the IN-list, and that predicate eliminated the row for "i == 2" (because 2 is not between
2.8 and 4.23).  But the changes for DERBY-47 replaced the BETWEEN optimization with a multi-probe
approach, thus further exposing (but not causing) the execution-time bug described here. 
This is further backed up by the fact that even before DERBY-47 the following query returned
two rows, as well:

ij> select * from t1 where i in (cast (2.8 as decimal(4, 2)), 4.23);
I
-----------
2
4

2 rows selected  

The reason is that the explict CAST disabled the BETWEEN optimization, thereby leading to
the same situation as just outlined.

> Wrong Results: Use of decimal values in an IN-list with INTEGER left operand can lead
to extra rows.
> ----------------------------------------------------------------------------------------------------
>
>                 Key: DERBY-2256
>                 URL: https://issues.apache.org/jira/browse/DERBY-2256
>             Project: Derby
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 10.0.2.0, 10.0.2.1, 10.0.2.2, 10.1.1.0, 10.1.2.1, 10.1.3.1, 10.1.3.2,
10.1.4.0, 10.2.1.6, 10.2.2.0, 10.2.2.1, 10.2.3.0, 10.3.0.0
>            Reporter: A B
>         Assigned To: A B
>
> While trying out some code changes for DERBY-47 I was running a few test cases and happened
to notice that there are a couple of cases in which Derby behaves incorrectly (that or my
understanding of what should be happening here is way off).
> First and most simply: the following query should return zero rows (unless I'm missing
something?), but it returns one:
> ij> create table t1 (i int);
> 0 rows inserted/updated/deleted
> ij> insert into t1 values 1, 2, 3, 4, 5;
> 5 rows inserted/updated/deleted
> -- Correct returns zero rows.
> ij> select * from t1 where i in (4.23);
> I
> -----------
> 0 rows selected
> -- But this one returns 1 row...
> ij> select * from t1 where i in (2.8, 4.23);
> I
> -----------
> 4
> 1 row selected
> Secondly, if the IN-list contains a non-constant value then Derby can incorrectly return
rows that do not match the IN predicate.  I think this is because some internal casting is
happening when it shouldn't?
> ij> create table t1 (i int);
> 0 rows inserted/updated/deleted
> ij> insert into t1 values 1, 2, 3, 4, 5;
> 5 rows inserted/updated/deleted
> -- Following values clause returns "2.80", as expected.
> ij> values cast (2.8 as decimal(4, 2));
> 1
> -------
> 2.80
> 1 row selected
> -- But if we use it in an IN-list it gets cast to "2" and thus returns a match.
> -- We get 2 rows when we should get NONE.
> ij> select * from t1 where i in (cast (2.8 as decimal(4, 2)), 4.23);
> I
> -----------
> 2
> 4
> 2 rows selected
> I confirmed that we see these results on trunk, 10.2, 10.1, and even as far back as svn
#201660 for 10.0.  I also ran the above statements on DB2 v8 as a sanity check to confirm
that NO results were returned there.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message