db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "A B (JIRA)" <j...@apache.org>
Subject [jira] Updated: (DERBY-3926) Incorrect ORDER BY caused by index
Date Wed, 06 May 2009 16:29:31 GMT

     [ https://issues.apache.org/jira/browse/DERBY-3926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

A B updated DERBY-3926:
-----------------------

    Attachment: d3926_repro.sql

Attaching another SQL file, d3926_repro.sql, which reproduces the problem (for me) with a
simpler set of tables and data.

This repro was motivated by the observation made by Mike and re-iterated by Mamta, namely:

> the key here is that the outer table(table3) is returning more than one row and
> each one of those row is requiring us to look at the middle table (table2) which
> results into 3 scans on table2 

Put differently, the outer table is the one which is "driving" the iteration because a) each
row from the outer table leads to a scan on the middle table, and b) the optimizer determines
that no sort is necessary.  Thus the order of the result is based solely on the order of the
rows that are retrieved from the outer table.

Some observations about what was necessary to get this particular repro to work:

  1) The outer table (T1) has a non-unique index on column I1, and we force the optimizer
to use that non-unique index.

      Why? If the optimizer were to use a table scan for T1 then we would check to see if
T1 was a "oneRowResultSet", which it isn't (and can't be, since we need T1 to return multiple
rows in order to satisfy observation 3 below).  Since it's not a one-row result set T1 would
then get added to the list of "unordered optimizables" for the join order--and if that list
has at least one optimizable in it, we would end up doing an explicit sort and thus the problem
would not repro.  So the plan for T1 must use an index.

  2) There is a predicate in the WHERE clause which compares the non-unique indexed column
T1.I1 to a CONSTANT expression.

      Why? If such a predicate did not exist then T1.I1 would be added as the first column
in the "rowOrdering" for the join order.  Then when the optimizer adds the middle table (T2)
to the join order, it would see that the index for T2 does *not* satisfy the ordering requirement
of T1.I1, which means we would end up doing an explicit sort for the whole plan.  So the problem
would not repro.  By adding a predicate to compare T1.I1 with a constant, we effectively make
T1.I1 "always ordered" and so we do not need to add it to the row ordering.

  3) The outer table (T1) has multiple rows which have the same value for the indexed column
T1.I1.

      Why? The presence of multiple rows is important because that's what leads to multiple
scans on the middle table (as pointed out by Mike and Mamta).  So we need to have a predicate
which compares to a fixed constant value (observation 2), but we also need that predicate
to return multiple rows.  Thus there must be multiple rows in T1 which have I1 column values
that equal the constant value used in the predicate.  (This is why the index for T1 must be
non-unique.)

  4) The middle table (T2) has an index that is ordered the same way as the ORDER BY clause.
 We force the optimizer to use that index for T2.

      Why? If the optimizer is using an index that satisfies the ordering requirement for
the query, it will try to avoid sorting the resultant rows.  Sort avoidance is key to reproducing
the reported behavior--esp. the optimizer *thinks* it can avoid the sort, and does so, but
in truth it should *not* have done so.

  5) The index for T2 is not covering--and esp. it does *not* include the column T2.I2 that
is used for joining with the outer table.

      Why? The fact that the index is non-covering means that we will have to go to the T2
table conglomerate to fetch the row that has the current join value--and access to the table
conglomerate is _unordered_.

  6)  The column from T1 (outer table) that is joined with T2 (middle table) has varied values
for each of the rows.

      Why? The presence of different values in T1.J1 means that we will scan T2's table conglomerate
multiple times for different T2.I2 values, and since the table conglomerate is unordered (observation
5), those multiple scans will return the rows in an order that does *not* match the index
order.

  7) The rows that are inserted into T2 are inserted in an order that does NOT match the ORDER
BY ordering.

      Why? It appears that, when inserting rows into a table, the order of the rows in the
base table conglomerate generally matches the insertion order.  I don't think there are any
guarantees of that, but that's what I observed for this simple data.  So if we were to insert
the rows in proper order, access to the base table conglomerate (observations 5 & 6) might
in fact return the rows in the desired order by accident, which would hide the problem that
we're trying to reproduce.

With all of those observations in place, I was able to write the attached script to reproduce
the problem for me.  The results I see when I run are:

J1         |J2  |J3
---------------------------
0          |f   |five
1          |g   |six
2          |e   |four

but the query specifies "ORDER BY t2.j2", so the rows are in the wrong order.

It would be great if others could try to run the script to make sure they see the same behavior
(if not, then some or all of this comment may in fact be wrong or incomplete...).

> Incorrect ORDER BY caused by index
> ----------------------------------
>
>                 Key: DERBY-3926
>                 URL: https://issues.apache.org/jira/browse/DERBY-3926
>             Project: Derby
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 10.1.3.3, 10.2.3.0, 10.3.3.1, 10.4.2.0
>            Reporter: Tars Joris
>            Assignee: Mamta A. Satoor
>         Attachments: d3926_repro.sql, derby-reproduce.zip, script3.sql, script3WithUserFriendlyIndexNames.sql,
test-script.zip
>
>
> I think I found a bug in Derby that is triggered by an index on a large column: VARCHAR(1024).
I know it  is generally not a good idea to have an index on such a large column.
> I have a table (table2) with a column "value", my query orders on this column but the
result is not sorted. It is sorted if I remove the index on that column.
> The output of the attached script is as follows (results should be ordered on the middle
column):
> ID                  |VALUE        |VALUE
> ----------------------------------------------
> 2147483653          |000002       |21857
> 2147483654          |000003       |21857
> 4294967297          |000001       |21857
> While I would expect:
> ID                  |VALUE        |VALUE
> ----------------------------------------------
> 4294967297          |000001       |21857
> 2147483653          |000002       |21857
> 2147483654          |000003       |21857
> This is the definition:
> CREATE TABLE table1 (id BIGINT NOT NULL, PRIMARY KEY(id));
> CREATE INDEX key1 ON table1(id);
> CREATE TABLE table2 (id BIGINT NOT NULL, name VARCHAR(40) NOT NULL, value VARCHAR(1024),
PRIMARY KEY(id, name));
> CREATE UNIQUE INDEX key2 ON table2(id, name);
> CREATE INDEX key3 ON table2(value);
> This is the query:
> SELECT table1.id, m0.value, m1.value
> FROM table1, table2 m0, table2 m1
> WHERE table1.id=m0.id
> AND m0.name='PageSequenceId'
> AND table1.id=m1.id
> AND m1.name='PostComponentId'
> AND m1.value='21857'
> ORDER BY m0.value;
> The bug can be reproduced by just executing the attached script with the ij-tool.
> Note that the result of the query becomes correct when enough data is changed. This prevented
me from creating a smaller example.
> See the attached file "derby-reproduce.zip" for sysinfo, derby.log and script.sql.
> Michael Segel pointed out:
> "It looks like its hitting the index ordering on id,name from table 2 and is ignoring
the order by clause."

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message