db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike Matrigali (JIRA)" <j...@apache.org>
Subject [jira] Updated: (DERBY-2212) Add "Unique where not null" to create index
Date Fri, 19 Oct 2007 07:58:50 GMT

     [ https://issues.apache.org/jira/browse/DERBY-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Mike Matrigali updated DERBY-2212:
----------------------------------


I don't think I have explained very clearly why I don't think the current
approach is best.  I especially have been using unique/non-unique incorrectly.
In the following I am only talking about the internal
store btree implementation, for this discussion ignore the SQL index built
on top of them.

The btree design is ARIES based.  The implementation actually only implements
a unique btree, but by picking key columns and options one can implement
both unique and non-unique indexes using the same code.  Only supporting unique
btree actually simplifies a lot of the binary search, leaf/branch splitting, 
and node merging code.  So code that uses the btree to implement something like
a unique constraint doesn't actually tell the code it is unique.  There are
a number of options you specify when you are creating the btree.  There are
2 that are of interest to this discussion:
number of columns in the key and leading number of columns in the key that 
uniquely identifies the row in the btree.  


As the most simple example take the implementation of a single column SQL
index.  In that case the btree is actually created with 2 columns, which
when combined are always unique (whether or not it is a unique or non-unique
SQL index).  In the current implementation of a unique single column index
the btree is created with 2 columns and it knows that a row can be uniquely
identified by the 1st column.  In the case of a non-unique single column
index the btree is created with with 2 columns and told that it takes both
columns to uniquely identify a row in the index.  The system insures that all
the rows in the index in this case are unique by insuring that the row 
location column is unique, derby goes out of it's way to never reuse a row
id and thus insure that every row inserted in a base table has a unique
row id.
The reason we use number of columns rather than a simple unique vs. non-unique
setting is that someday we may also add columns to the index that actually 
don't participate at all in the searching of the index and in that case there 
might be a 7 column btree where only 3 leading columns are necessary (this
is often requested for optimizing performance using covered queries for an 
index).
The setting of how many
columns it takes to uniquely identify a row is critical, and is used for a
number of different operations within the btree some of which have nothing to
do with returning result sets to SQL queries.  

At the btree level it is important that comparisons of each element remain
consistent.  So even if SQL thinks null's can't be compared, in order to be
able to find the rows at the btree level there must be a consistent comparison
between one null and another.  The current implementation is that a null
equals a null, and seems the easiest to implement at this level.  

All of the above have led to the current situation where in a unique index
one null is allowed but not 2.  

As soon as you want to implement a 2 column btree which allows the insertion
of: (null, row id 1), (null, row id 2), then it is clear that the minimun
number columns necessary to uniquely identify the row is 2 and not 1.  To me
that makes it a non-unique index, but I don't actually care what the language
system catalogs call it -- as long as btree is correctly created indicating
that it takes 2 columns to uniquely identify the row.

But of course it is not that simple.  The job of insert includes, but is
not limited to:
1) determine if the row is valid to insert
2) determine exactly the "right" place in the index to insert the row
3) lock the correct rows to implement whatever the isolation level of the xact

Current "unique index" implementation (not the whole story leaving out details concerning
finding and dealing with deleted rows):
A we search the index for an exact match on the number of columns necessary 
  to uniquely identify the row, so in the case of single column index we just
  search using 1 column.  If we get a match then there is a duplicate key
  error.
B The search in step A is set up such that if there is no exact match then the 
  same code has found the exact row in the tree where the row is to be 
  inserted just before.  This works because we "know" that the above search 
  that only used the 1st column must find the right place in the tree for column
  the value of the 2nd column does not matter.
C locks the row being inserted.
D Depending on isolation level it may also lock the previous row to insure
  no phantom inserts happen to a serializable reader.  Note the previous row
  may not be on the current leaf page.

Current "non-unique" index implementation is exactly the same except that the
checks are altered based on the number of columns it takes to unique identify
a row.
A we search the index for an exact match on the number of columns necessary 
  to uniquely identify the row, so in the case of single column index we 
  search using 2 column.  If we get a match then there is a duplicate key
  error, but only a bug will make this happen as the 2nd column should always
  be unique by itself - but if you went in somehow and forced the insert of
  "1, rowid 1", "1, rowid 1" into this btree a duplicate key error would be 
   thrown (or maybe an ASSERT - not sure).
B The search in step A is set up such that if there is no exact match then the 
  same code has found the exact row in the tree where the row is to be 
  inserted just before.  In this case the search has used all columns so if it
  doesn't find exact match it again has ended up exactly where the row should
  be inserted.
C locks the row being inserted.
D Depending on isolation level it may also lock the previous row to insure
  no phantom inserts happen to a serializable reader.  Note the previous row
  may not be on the current leaf page.
  

The search code interfaces were carefully crafted to do exactly what we wanted
in these 2 cases.  I think the new behavior does not lend itself to a single
search to perform both A and B.  Assume the null duplicate behavior we are 
looking for can be described as (I am waiting on an answer to whether this
assumption is valid): 
    if there are any null's in any column in the row don't do duplicate 
    checking, otherwise do duplicate checking on leading n-1 columns.
    
Here is what I think may work best for the new behavior, 
with the specific example of a 1 column index:

1) set the btree to 2 columns
2) set the number of columns to uniquely identify the row to be 2 also
3) set a new flag that indicates special duplicate checking processing needed
4) 
   if (magic duplicate checking flag)
   (
       null_present_in_key column = search the row for nulls;

       Perform the search exactly as we do above in the non-unique index 
       case as described in step A, you will have the right place to insert
       the row.  

       if (null_present_in_key_column)
       {
           you are done, insert the row no duplicate checking necessary
       }
       else
       {
           // the following may not be exactly right, but I think you can 
           // do duplicate checking somehow just based on the row to your left
           // and row to right.  The idea is that worst case depending on 
           // order of your row location you have either landed left or right
           // of the single key that matches the insert row.
           logically compare insert row to row to left and row to right using
           only 1 column, no magic null handling necessary.  If either compare 
           equal then throw duplicate key.
       }
   }
5) do same isolation locking as existing code for duplicate indexes.

   This might not be the most elegant, but I think it has the least chance of
   breaking a lot of the system at the cost of 1 null row check, 
   and 2 row compares.  
   The code is isolated to insert and no change is necessary to any compare
   interfaces.  It should perform better than forcing the language layer to 
   first do a probe and then do an insert for this kind of unique index.  
   



> Add "Unique where not null" to create index
> -------------------------------------------
>
>                 Key: DERBY-2212
>                 URL: https://issues.apache.org/jira/browse/DERBY-2212
>             Project: Derby
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 10.2.1.6
>            Reporter: Oleksandr Alesinskyy
>            Assignee: Anurag Shekhar
>         Attachments: derby-2212preview.diff, derby-2212preview2.diff
>
>
> Derby prohibits creation of unique constraints on nullable colums (as well if only some
columns in the constraint list are nullable) and treat nulls in unique indexes as normal values
(i.e. only one row with null values in indexed columns may be inserted into the table). This
bahavior is very restrictive, does not completely comply with SQL standards (both letter and
intent) as well as with business needs and intending meaning of NULL values (2 null values
are not considered as equal, this comparision shall return NULL, and for selection criteria
boolean null is treated as FALSE).
> This behavior, as far as I can see, is modelled after DB2 (and differs from behavior
of most other major databases, like SyBase, Oracle, etc.).
> But even DB2 provide some means to alleviate these restrictions, namely "UNIQUE WHERE
NOT NULL" clause for CREATE INDEX statement.
> It will be very good if such "UNIQUE WHERE NOT NULL" clause will be introduced in Derby.
> Regards,
> Oleksandr Alesinskyy

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message