db-derby-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Db-derby Wiki] Update of "OLAPNullOrdering" by BryanPendleton
Date Tue, 24 Jul 2007 20:34:12 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Db-derby Wiki" for change notification.

The following page has been changed by BryanPendleton:
http://wiki.apache.org/db-derby/OLAPNullOrdering

The comment on the change is:
Incorporated comments from DERBY-2887

------------------------------------------------------------------------------
- As part of the ["OLAPOperations"] effort, we'd like to add support for null ordering.
+ As part of the ["OLAPOperations"] effort, we'd like to add support for null ordering. Jira
issue [https://issues.apache.org/jira/browse/DERBY-2887 DERBY-2887] tracks this effort.
  
  == Rationale ==
  
- In earlier versions of the SQL standard, the ordering of NULL values was implementation-defined.
In Derby, NULL values sort high, and always appear after all non-NULL values in ascending
order and before all non-NULL values in descending order.
+ In earlier versions of the SQL standard, the ordering of NULL values was implementation-defined.

  
  The SQL 2003 standard allows the user to explicitly specify how to sort NULL values, by
adding the new NULLS FIRST or NULLS LAST specification in the ORDER BY clause.
+ 
+ In Derby versions 10.0 through 10.3, NULL values always sort high, and thus are always ordered
after all non-NULL values in ascending order and before all non-NULL values in descending
order. In the language of the standard, NULLS LAST is the default for ascending order, and
NULLS FIRST is the default for descending order. 
  
  == Syntax ==
  
@@ -32, +34 @@

  
  The new element here is the <null ordering> specification, which is optional.
  
- If the <null ordering> is not specified, then an implementation-defined <null ordering>
is used. For Derby, this will continue to be NULLS LAST.
+ If the <null ordering> is not specified, then an implementation-defined <null ordering>
is used. For Derby, this will continue to be NULLS LAST if the sort is ASCending, and NULLS
FIRST if the sort is DESCending.
  
  The tokens {{{FIRST}}} and {{{LAST}}} are already reserved keywords in the Derby SQL grammar;
the new token {{{NULLS}}} needs to be added as a new non-reserved keyword.
+ 
+ == FIRST and LAST, not LOW and HIGH ==
+ 
+ Note that the standard carefully uses the terms FIRST and LAST, rather than LOW and HIGH.
This is intentional, and highlights that the null ordering interacts with the ascending/descending
nature of the sort. It may make it more clear to look at this truth table:
+ 
+ ||ORDER BY says||ASC||DESC||
+ ||NULLS FIRST||less||greater||
+ ||NULLS LAST||greater||less||
+ 
+ where "less" means "NULL values shall be considered to be '''less''' than non-NULL values",
and "greater" means "NULL values shall be considered to be '''greater''' than non-NULL values.
+ 
+ === Interaction of NULLS FIRST/LAST and ASC/DESC ===
+ 
+ The results below indicate what I believe to be the intended behavior of NULLS FIRST when
combined with ASC/DESC. Note that NULLS FIRST means "the NULL values should be the first ones
in the result", regardless of whether the result is ordered ASC or DESC, and NULLS LAST behaves
correspondingly.
+ 
+ {{{
+ ij> select * from t1;
+ C1         |C2
+ -----------------------
+ NULL       |NULL
+ 1          |1
+ NULL       |NULL
+ 2          |1
+ 3          |1
+ 10         |10
+ 
+ 6 rows selected
+ ij> select * from t1 order by c1; -- since it was not specified, the implementation chooses
NULLS LAST here
+ C1         |C2
+ -----------------------
+ 1          |1
+ 2          |1
+ 3          |1
+ 10         |10
+ NULL       |NULL
+ NULL       |NULL
+ 
+ 6 rows selected
+ ij> select * from t1 order by c1 NULLS LAST;
+ C1         |C2
+ -----------------------
+ 1          |1
+ 2          |1
+ 3          |1
+ 10         |10
+ NULL       |NULL
+ NULL       |NULL
+ 
+ 6 rows selected
+ ij> select * from t1 order by c1 nulls first;
+ C1         |C2
+ -----------------------
+ NULL       |NULL
+ NULL       |NULL
+ 1          |1
+ 2          |1
+ 3          |1
+ 10         |10
+ 
+ 6 rows selected
+ ij> select * from t1 order by c1 desc; -- since it was not specified, the implementation
chooses NULLS FIRST here
+ C1         |C2
+ -----------------------
+ NULL       |NULL
+ NULL       |NULL
+ 10         |10
+ 3          |1
+ 2          |1
+ 1          |1
+ 
+ 6 rows selected
+ ij> select * from t1 order by c1 desc nulls last;
+ C1         |C2
+ -----------------------
+ 10         |10
+ 3          |1
+ 2          |1
+ 1          |1
+ NULL       |NULL
+ NULL       |NULL
+ 
+ 6 rows selected
+ ij> select * from t1 order by c1 desc nulls first;
+ C1         |C2
+ -----------------------
+ NULL       |NULL
+ NULL       |NULL
+ 10         |10
+ 3          |1
+ 2          |1
+ 1          |1
+ 
+ 6 rows selected
+ }}}
  
  == Current Implementation ==
  
@@ -63, +159 @@

  
  (As an aside, I wonder if there once was a time when NULL values sorted low in Derby, rather
than sorting HIGH. The reason I think this might have been true is that the JavaDoc for {{{DataValueDescriptor.compare()}}}
is incorrect, and says that null will be treated as '''less''' than all other values, and
also the code comment in NumberDataType.java is backward. Thirdly, the code in SQLBoolean.java
and in XML.java seems backward, which might be an un-noticed bug because the boolean data
type is not surfaced to users in current Derby, and the XML type is new and not much used
yet?)
  
- == Interaction of NULLS FIRST/LAST and ASC/DESC ==
- 
- Are the results below correct? In particular, how does NULLS FIRST interact with DESC? Is
it correct that if you say "DESC NULLS FIRST", then the NULL values actually appear last (they're
sorted FIRST, but the results are DESCending)?
- 
- {{{
- ij> select * from t1;
- C1         |C2
- -----------------------
- NULL       |NULL
- 1          |1
- NULL       |NULL
- 2          |1
- 3          |1
- 10         |10
- 
- 6 rows selected
- ij> select * from t1 order by c1;
- C1         |C2
- -----------------------
- 1          |1
- 2          |1
- 3          |1
- 10         |10
- NULL       |NULL
- NULL       |NULL
- 
- 6 rows selected
- ij> select * from t1 order by c1 nulls first;
- C1         |C2
- -----------------------
- NULL       |NULL
- NULL       |NULL
- 1          |1
- 2          |1
- 3          |1
- 10         |10
- 
- 6 rows selected
- ij> select * from t1 order by c1 desc;
- C1         |C2
- -----------------------
- NULL       |NULL
- NULL       |NULL
- 10         |10
- 3          |1
- 2          |1
- 1          |1
- 
- 6 rows selected
- ij> select * from t1 order by c1 desc nulls first;
- C1         |C2
- -----------------------
- 10         |10
- 3          |1
- 2          |1
- 1          |1
- NULL       |NULL
- NULL       |NULL
- 
- 6 rows selected
- ij> select * from t1 order by c1 desc nulls last;
- C1         |C2
- -----------------------
- NULL       |NULL
- NULL       |NULL
- 10         |10
- 3          |1
- 2          |1
- 1          |1
- 
- 6 rows selected
- }}}
- 
  == Proposed Changes ==
  
- There are several parts to implementing the <null ordering> feature for Derby:
+ There are several parts to implementing the <null ordering> feature for Derby: 
  
   * The new syntax must be added to the SQL grammar
   * The null ordering specification must be passed around through the compiler data structures
so that it eventually gets passed to the sorter. This involves data structures like {{{OrderByList}}}
and {{{OrderByColumn}}}. Eventually, the compiler must set up the correct {{{ColumnOrdering}}}
objects, which I think happens in {{{RowOrderingImpl.addOrderedColumn}}}
+  * Note that in the compiler, where we are working with the user's SQL statement, we should
use the terms "first" and "last", but once the compiler has finished compiling the statement,
and is constructing the data structures for execution, we will switch to using the terms "low"
and "high".
-  * The data structure which bridges the gap between the compiler and the execution code
is the {{{IndexColumnOrder}}} class. This class is used both for describing ORDER BY sorts
for queries, as well as for describing the sorts used when creating indexes. Since indexes
in Derby will remain unconditionally NULLS LAST, the index-creation callers of the {{{IndexColumnOrder}}}
class will specify NULLS LAST unconditionally, while the query compilation callers will specify
either NULLS FIRST or NULLS LAST depending on what was specified in the query.
+  * The data structure which bridges the gap between the compiler and the execution code
is the {{{IndexColumnOrder}}} class. This class is used both for describing ORDER BY sorts
for queries, as well as for describing the sorts used when creating indexes. Since indexes
in Derby will always sort nulls high, the index-creation callers of the {{{IndexColumnOrder}}}
class can continue to do this unconditionally, while the query compilation callers must specify
either low or high depending on what was specified in the query.
-  * The sorter must be changed so that in addition to having a columnOrderingMap and a columnOrderingAscendingMap,
it also has a columnOrderingNullOrderingMap, where it records the NULLS FIRST or NULLS LAST
specification for each column in the sort.
+  * The sorter must be changed so that in addition to having a columnOrderingMap and a columnOrderingAscendingMap,
it also has a columnOrderingNullOrderingMap, where it records whether nulls should be sorted
low or high, for each column in the sort.
   * The sorter must pass the user's choice to the DataValueDescriptor.compare() method call.
-  * The DataValueDescriptor.compare(DataValueDescriptor other) method must be changed to
allow the null ordering choice to be passed in
-  * All of the types which implement the compare method must be changed to use the specified
null ordering.
+  * The DataValueDescriptor.compare(DataValueDescriptor other) method must be changed to
allow the null ordering choice to be passed in. We want to be careful here, because the passing
of an extra argument to the compare() method will affect performance. For example, within
an index btree lookup the compare method will be called a lot, and we don't want to be needlessly
passing in an argument which will always be false.
+  * Also, wherever possible, we want to consolidate common data type code into the common
superclass, rather than repeating redundant code throughout the subclasses. One way or another,
all of the types which implement the compare method must be changed to use the specified null
ordering (possibly by calling a new common method which implements the null ordering decision,
then delegates through to the data-type-specific subclass for comparison of non-null values).
+  * The code which decides whether a query can use an index scan instead of a sort needs
to consider whether the null ordering allows that. This means that in some cases, depending
on how the user has specified the null ordering, we cannot use an index scan and must use
a sort instead.
+  * There seems to be some interaction between all the above code and the implicit "IS NULL"
predicate that is part of an OUTER join, which is causing me to get different "ordered null
semantics" on some of the OUTER JOIN queries in the Wisconsin regression suite. This behavior
needs to be investigated and resolved, since I don't think the null ordering changes should
affect the ordered null semantics of an OUTER JOIN.
   * A variety of new test cases must be added to the test suites, including:
     * There should be tests of ORDER BY clauses with NULLS FIRST specified, insuring that
null values indeed come first.
     * There should be tests of ORDER BY clauses with NULLS LAST specified, insuring that
null values indeed come last.
@@ -157, +183 @@

        * date/time data types
        * XML data types?
     * There should be tests with queries involving UNION / EXCEPT / INTERSECT operators,
ensuring that the NULLS FIRST / NULLS LAST setting is correctly "pushed down" through the
set operation.
+    * There should be tests that verify that sort avoidance occurs properly:
+      * if an ORDER BY clause can be validly satisfied by an index, it should continue to
do so
+      * if an ORDER BY clause can '''not''' be satisfied by an index, due to the user's <null
ordering> specification, the optimizer should force a true sort to occur.
+    * There should be tests that verify that the OUTER JOIN "ordered null semantics" are
preserved (FIXME: need to understand this better).
   * The existing test cases should all be run to verify that they still pass.
  

Mime
View raw message