db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dag H. Wanvik (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (DERBY-532) Support deferrable constraints
Date Wed, 06 Nov 2013 16:04:18 GMT

    [ https://issues.apache.org/jira/browse/DERBY-532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13814871#comment-13814871
] 

Dag H. Wanvik edited comment on DERBY-532 at 11/6/13 4:02 PM:
--------------------------------------------------------------

Thanks a lot, Mike! I have answered inlined below. I might not have understood all your point,
please bear with me, and don't hesitate to ask again/elaborate on unclear issues.

> I would like to understand what would be the problem with
> implementing deferred constraints using existing "duplicate" BTree
> with no changes, and then do all deferred constraint checking in the
> sql layer doing probes to ascertain constraint consistency using
> existing access interfaces (with maybe something more specialized if
> that allows for faster path).

Not sure if I understand you correctly there, but the approach taken is essentially to use
the existing "duplicate" BTree approach implemented for nullable unique constraints/indexes,
but with a twist:

* when we see a duplicate on insert (this is already being checked, nothing new there), remember
it

* for primary key and unique not null constraints which are deferrable (and only those: not
deferrable constraints/indexes are unchanged except for a couple of extra tests, see below),
we now use the "duplicate" BTree approach also

Somehow, we do need to intercept the current duplicate checks when BTree inserts happen, so
they don't throw and data does get inserted: otherwise we'd need to store duplicate rows somewhere
else and this would make operation later in the transaction difficult: for example a query
would need to look in two places for data, I didn't want to go there...

> I know you are following a path that the "unique" with duplicate code
> followed. I have regretted that approach since it was implemented. The
> project resulted in many bugs because of the complication involved in
> twisting the code to do something that it really did not want to do.

Yes I am aware of that; I did find one extra bug with this approach, which I have fixed as
part of this work: DERBY-6374. If/when we redesign this approach, I hope we could carry the
implementation for deferred constraints along somehow.

> if the new BTree approach is checked in, should add tests to look out
> for bugs like DERBY-3502 which came about the last time a new, but
> similar BTree was checked in.

> It looks like a lot of the fast path code paths now have added code to
> deal with differed update. Should code for instance in
> InsertResultSet.java be broken out and there to be one routine to
> handle deferred constraint and one to not. With some sort of
> inheritance to handle it?

As far as the existing fast code path (for not deferrable constraints and for indexes which
are not "nullable unique") is concerned, the new code adds the following extra overhead:

{noformat}
* IndexChanger#insertAndCheckDups: one extra simple boolean test:
  
      "!deferrable"

* BTreeController#doIns: the existing two tests for

      btree.isUniqueWithDuplicateNulls()

  now has an extra test:

      btree.isUniqueWithDuplicateNulls() ||
      btree.hasDeferrableChecking()

This could be optimized to just use the booleans directly thus:

     btree.uniqueWithDuplicateNulls ||
    btree.uniqueDeferrable
{noformat}

so, in total two extra simple boolean checks per insert, which isn't bad I think. I am sure
we could factor this out into more object oriented fashion, but I'd worry that would entail
much redundancy and/or extra method calls for the current fast path.

> should there be changes to update code in addition to insert code, or
> is this one of the benefits of doing this work at lowest level where
> all updates on BTree's are turned into deletes and inserts?

Yes, we get that for free.

> deferred to end transaction seems like definitely the hardest part of
> deferred constraints. As an incremental strategy did you consider
> doing just the "alter table" level part first. I wonder how many
> customers would be helped by just being able to turn off constraint
> checking initially and enabling later and are ok with taking the hit
> of dropping and recreating indexes for this at "check" time.

Yes I did consider it, but we have use cases that can benefit from the standard SQL feature.

> I am trying to understand the locking and isolation implications of
> doing the "duplicate" checking in the non-duplicate new btrees at
> insert time.

Again, not sure if I understand 100% what you're asking, but: I tried to make insertion of
duplicate rows work with non-duplicate B-tree, but I had to abandon that approach. Cf. above:
for deferred constraints we always use the approach of the duplicate "nullable unique" B-trees.
Using that approach for deferrable PRIMARY KEY and UNIQUE "not null" constraints, will shows
locking behavior similar to the current "nullable unique" implementation, I should think.

Duplicate rows would keep their row X locks till transaction end since as far as the index
is concerned, they are all unique (in that the key includes the row location of the base row).
As far as isolation, I think only other transaction that use "read uncommitted" could ever
see the duplicates.

> And what is the right thing to do with rows marked deleted.

Yes, that's what I hit problems with

> I think doing this check at insert time is going to add unintended
> problems with either isolation or locking.

Given the above, not sure I understand how. Could you explain what would be the concern here?
Again, if the nullable unique indexes are OK as far as isolation and locking, I think the
deferred variants should be ok too?

> I think the insert time check for duplicates is really a
> performance thing to narrow down the number of deferred rows to keep
> track of,

Well, today, the inserts will fail due to the duplicate checks in the B-tree, so we'd need
to do something to intercept this in any case. Can you sketch is some more detail what you
have in mind?

> perform which then could be optimized later with a more complicated
> btree implementation later if necessary. Would this type of
> implementation also match up well with whatever we would have to do
> for deferred foreign key constraints?
>
> We have always made the indexes created for constraint checking
> available to the optimizer to pull rows out of in the past. Should
> these new deferred constraint indexes also be available, or is there
> anything special we can do with these to make deferred update
> implementation easier. I looked at some public documents and did not
> get any specific info on how other db's do this, but got the feeling
> that these might be treated differently in some way vs. other
> indexes. We should be careful on implementation to make sure stuff
> like sort avoidance for uniqueness works correctly with these.

I made sure that the deferred indexes would all answer "false" to the isUnique() method which
the optimizer consults, so I think this would bar the optimizer from presuming (wrongly) that
there is only one row. Again, this might make some queries on deferrable indexes less performant,
but that's the price one has to pay to have deferred constraints. The existing non deferrable
unique indexes/constraint will not be impacted.

> If the deferred unique constraint indexes were not made available to
> the optimizer to satisfy query results from (and thus the rows could
> be out of date until end of transaction), then another option would be
> to define the unique constraints exactly as they are today in the
> store - but somehow mark them as not available to optimizer.

Not sure I understand: meaning that the queries couldn't use the index at all until transaction
end? I think that would perform worse than what I currently do: offer a non-unique index?

> The obvious problem with them is that the inserting transaction
> should see rows it has inserted and won't. And just defer the
> updating of this index until end of transaction, driving the updates
> from a deferred list maintained by the sql layer.

In the current solution, the inserting transaction does see the duplicate rows.

> In this case all isolation level locking would be unchanged (just
> delayed) and no possible unexpected locking differences between a
> deferred and non-deferred constraint.

The deferred case would retain locks on more rows till transaction end, which is to be expected.
I think it is acceptable that deferrable constraints could show slightly different locking
behavior that non deferrable constraints in any case. But again, I might not understand exactly
what you are driving at here...

> Could the SQL layer during deferred mode keep track of every insert
> into the "duplicate" constraint index and then do the constraint
> check at commit time itself.

Instead of just keeping track of the duplicates, you mean? Again, we'd need to make sure the
inserts would succeed somehow. We do perform the constraint check at commit time in the present
solution...

> At end transaction time I think it is clear that every row that is
> looked at to do the duplicate check needs to be first locked and
> waited on and then checked, with locks released according to
> isolation level standards.

Agreed. The present solution may not be correct here, I'll take another look: The B-tree scan
I use at commit time uses ISOLATION_READ_COMMITTED_NOHOLDLOCK to check for the duplicates:
the transaction already holds X locks to the rows it inserted, but it would need read locks
on any other row with the same key. But presumably, for repeatable ready/serializable, we'd
already have those locks too?

> This would include rows marked deleted, which would be locked to
> make sure that are not part of other transactions that have not
> committed yet. This check could be coded as a scan for the key

That's what I do I think? Cf. DeferredDuplicates#validate

> and not 2 rows or it would be pretty clean to add a special
> interface to return true/false for more than one row match for a
> given key description.



was (Author: dagw):
Thanks a lot, Mike! I have answered inlined below. I might not have understood all your point,
please bear with me, and don't hesitate to ask again/elaborate on unclear issues.

> I would like to understand what would be the problem with
> implementing deferred constraints using existing "duplicate" btrees
> with no changes, and then do all deferred contraint checking in the
> sql layer doing probes to ascertain constraint consistency using
> existing access interfaces (with maybe something more specialized if
> that allows for faster path).

Not sure if I understand you correctly there, but the approach taken is essentially to use
the existing "duplicate" btrees approach implemented for nullable unique constraints/indexes,
but with a twist:

* when we see a duplicate on insert (this is already being checked, nothing new there), remember
it

* for primary key and unique not null constraints which are deferrable (and only those: not
deferrable constraints/indexes are unchanged except for a couple of extra tests, see below),
we now use the "duplicate" btrees approach also

Somehow, we do need to intercept the current duplicate checks when Btree inserts happen, so
they don't throw and data does get inserted: otherwise we'd need to store duplicate rows somewhere
else and this would make operation later in the transaction difficult: for example a query
would need to look in two places for data, I didn't want to go there...

> I know you are following a path that the "unique" with duplicate code
> followed. I have regretted that approach since it was implemented. The
> project resulted in many bugs because of the complication involved in
> twisting the code to do something that it really did not want to do.

Yes I am aware of that; I did find one extra bug with this approach, which I have fixed as
part of this work: DERBY-6374. If/when we redesign this approach, I hope we could carry the
implementation for deferred constraints along somehow.

> if the new btree approach is checked in, should add tests to look out
> for bugs like DERBY-3502 which came about the last time a new, but
> similar btree was checked in.

> It looks like a lot of the fast path code paths now have added code to
> deal with differed update. Should code for instance in
> InsertResultSet.java be broken out and there to be one routine to
> handle deferred constraint and one to not. With some sort of
> inheritance to handle it?

As far as the existing fast code path (for not deferrable constraints and for indexes which
are not "nullable unique") is concerned, the new code adds the following extra overhead:

{noformat}
* IndexChanger#insertAndCheckDups: one extra simple boolean test:
  
      "!deferrable"

* BTreeController#doIns: the existing two tests for

      btree.isUniqueWithDuplicateNulls()

  now has an extra test:

      btree.isUniqueWithDuplicateNulls() ||
      btree.hasDeferrableChecking()

This could be optimized to just use the booleans directly thus:

     btree.uniqueWithDuplicateNulls ||
    btree.uniqueDeferrable
{noformat}

so, in total two extra simple boolean checks per insert, which isn't bad I think. I am sure
we could factor this out into more object oriented fashion, but I'd worry that would entail
much redundancy and/or extra method calls for the current fast path.

> should there be changes to update code in addition to insert code, or
> is this one of the benefits of doing this work at lowest level where
> all updates on btree's are turned into deletes and inserts?

Yes, we get that for free.

> deferred to end transaction seems like definitely the hardest part of
> deferred constraints. As an incremental strategy did you consider
> doing just the "alter table" level part first. I wonder how many
> customers would be helped by just being able to turn off constraint
> checking initially and enabling later and are ok with taking the hit
> of dropping and recreating indexes for this at "check" time.

Yes I did consider it, but we have use cases that can benefit from the standard SQL feature.

> I am trying to understand the locking and isolation implications of
> doing the "duplicate" checking in the non-duplicate new btrees at
> insert time.

Again, not sure if I understand 100% what you're asking, but: I tried to make insertion of
duplicate rows work with non-duplicate B-tree, but I had to abandon that approach. Cf. above:
for deferred constraints we always use the approach of the duplicate "nullable unique" B-trees.
Using that approach for deferrable PRIMARY KEY and UNIQUE "not null" constraints, will shows
locking behavior similar to the current "nullable unique" implementation, I should think.

Duplicate rows would keep their row X locks till transaction end since as far as the index
is concerned, they are all unique (in that the key includes the row location of the base row).
As far as isolation, I think only other transaction that use "read uncommitted" could ever
see the duplicates.

> And what is the right thing to do with rows marked deleted.

Yes, that's what I hit problems with

> I think doing this check at insert time is going to add unintended
> problems with either isolation or locking.

Given the above, not sure I understand how. Could you explain what would be the concern here?
Again, if the nullable unique indexes are OK as far as isolation and locking, I think the
deferred variants should be ok too?

> I think the insert time check for duplicates is really a
> performance thing to narrow down the number of deferred rows to keep
> track of,

Well, today, the inserts will fail due to the duplicate checks in the B-tree, so we'd need
to do something to intercept this in any case. Can you sketch is some more detail what you
have in mind?

> perform which then could be optimized later with a more complicated
> btree implementation later if necessary. Would this type of
> implementation also match up well with whatever we would have to do
> for deferred foreign key constraints?
>
> We have always made the indexes created for constraint checking
> available to the optimizer to pull rows out of in the past. Should
> these new deferred constraint indexes also be available, or is there
> anything special we can do with these to make deferred update
> implementation easier. I looked at some public documents and did not
> get any specific info on how other db's do this, but got the feeling
> that these might be treated differently in some way vs. other
> indexes. We should be careful on implementation to make sure stuff
> like sort avoidance for uniqueness works correctly with these.

I made sure that the deferred indexes would all answer "false" to the isUnique() method which
the optimizer consults, so I think this would bar the optimizer from presuming (wrongly) that
there is only one row. Again, this might make some queries on deferrable indexes less performant,
but that's the price one has to pay to have deferred constraints. The existing non deferrable
unique indexes/constraint will not be impacted.

> If the deferred unique constraint indexes were not made available to
> the optimizer to satisfy query results from (and thus the rows could
> be out of date until end of transaction), then another option would be
> to define the unique constraints exactly as they are today in the
> store - but somehow mark them as not available to optimizer.

Not sure I understand: meaning that the queries couldn't use the index at all until transaction
end? I think that would perform worse than what I currently do: offer a non-unique index?

> The obvious problem with them is that the inserting transaction
> should see rows it has inserted and won't. And just defer the
> updating of this index until end of transaction, driving the updates
> from a deferred list maintained by the sql layer.

In the current solution, the inserting transaction does see the duplicate rows.

> In this case all isolation level locking would be unchanged (just
> delayed) and no possible unexpecting locking differences between a
> deferred and non-deferred constraint.

The deferred case would retain locks on more rows till transaction end, which is to be expected.
I think it is acceptable that deferrable constraints could show slightly different locking
behavior that non deferrable constraints in any case. But again, I might not understand exactly
wat you are driving at here...

> Could the SQL layer during deferred mode keep track of every insert
> into the "duplicate" constraint index and then do the constraint
> check at commit time itself.

Instead of just keeping track of the duplicates, you mean? Again, we'd need to make sure the
inserts would succeed somehow. We do perform the constraint check at commit time in the rpesent
solution...

> At end transaction time I think it is clear that every row that is
> looked at to do the duplicate check needs to be first locked and
> waited on and then checked, with locks released according to
> isolation level standards.

Agreed. The present solution may not be correct here, I'll take another look: The B-tree scan
I use at commit time uses ISOLATION_READ_COMMITTED_NOHOLDLOCK to check for the duplicates:
the transaction already holds X locks to the rows it inserted, but it would need read locks
on any other row with the same key. But presumably, for repeatable ready/serializable, we'd
already have those locks too?

> This would include rows marked deleted, which would be locked to
> make sure that are not part of other transactions that have not
> committed yet. This check could be coded as a scan for the key

That's what I do I think? Cf. DeferredDuplicates#validate

> and not 2 rows or it would be pretty clean to add a special
> interface to return true/false for more than one row match for a
> given key description.


> Support deferrable constraints
> ------------------------------
>
>                 Key: DERBY-532
>                 URL: https://issues.apache.org/jira/browse/DERBY-532
>             Project: Derby
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Jörg von Frantzius
>            Assignee: Dag H. Wanvik
>              Labels: derby_triage10_11
>         Attachments: deferredConstraints.html, deferredConstraints.html, deferredConstraints.html,
deferredConstraints.html, derby-532-import-1.diff, derby-532-import-1.status, derby-532-import-2.diff,
derby-532-import-3.diff, derby-532-import-3.status, derby-532-more-tests-1.diff, derby-532-more-tests-1.stat,
derby-532-syntax-binding-dict-1.diff, derby-532-syntax-binding-dict-1.status, derby-532-syntax-binding-dict-2.diff,
derby-532-syntax-binding-dict-2.status, derby-532-syntax-binding-dict-all-1.diff, derby-532-testAlterConstraintInvalidation.diff,
derby-532-testAlterConstraintInvalidation.status, derby-532-unique-pk-1.diff, derby-532-unique-pk-1.status,
derby-532-unique-pk-2.diff, derby-532-unique-pk-3.diff, derby-532-unique-pk-3.status, derby-532-xa-1.diff,
derby-532-xa-2.diff, derby-532-xa-3.diff, derby-532-xa-3.status
>
>
> In many situations it is desirable to have constraints checking taking place only at
transaction commit time, and not before. If e.g. there is a chain of foreign key constraints
between tables, insert statements have to be ordered to avoid constraint violations. If foreign
key references are circular, the DML has to be split into insert statements and subsequent
update statements by the user.
> In other words, with deferred constraints checking, life is much easier for the user.
Also it can create problems with softwares such as object-relational mapping tools that are
not prepared for statement ordering and thus depend on deferred constraints checking.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message