db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dag H. Wanvik (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (DERBY-4741) Make Derby work reliably in the presence of thread interrupts
Date Mon, 29 Nov 2010 00:51:40 GMT

    [ https://issues.apache.org/jira/browse/DERBY-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12964612#action_12964612
] 

Dag H. Wanvik edited comment on DERBY-4741 at 11/28/10 7:50 PM:
----------------------------------------------------------------

This patch (derby-4741-c-01-nio) closes two corner cases I have
observed when stress testing the RAFContainer4 recovery mechanism. It
does some other small cleanups. Regressions ran OK.

RAFContainer:

If we receive an interrupt when the container is first being opened
(i.e. during RAFContainer.run (OPEN_CONTAINER_ACTION) ->
getEmbryonicPage), recovery will fail because currentIdentity needed
in RAFContainer4#recoverContainerAfterInterrupt hasn't yet been
set. 

RAFContainer4:

If a stealthMode read is interrupted and is recovering the container,
it erroneously increments threadsInPageIO just before exiting to retry
IO. This leads to a break in the invariant that threadsInPageIO be 0
when all threads are done, causing issue (hang) down the line.  

If, when we are reopening the container, the read being done during
that operation (getEmbryonicPage), that stealth mode read will also
lead to a (recursive) recovery. We have to catch this case by adding a
"catch (InterruptDetectedException e)" just after the call to
openContainer, not by testing the interrupt flag as presently done,
since the recovery inside the recursive call to
getEmbryonicPage/readPage will already have cleared the flag and done
recovery.

When giving up reopening the container for some reason, we also forgot
to decrement threadsInPageIO.

To guard against other hangs, I will make the while-true loops max out
in all cases. But before I commit that change, it would be nice to see
if this patch has any impact on DERBY-4920 (I suspect not). The reason
I'd like to hold off with that is that since DERBY-4920 occurs during
shutdown, that may mask the fact that recovery failed, cf. my comment
https://issues.apache.org/jira/browse/DERBY-4920?focusedCommentId=12936016&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12936016
So, I'd rather wait with that till I get DERBY-4920  out of the way.

      was (Author: dagw):
    This patch (derby-4741-c-01-nio) closes two race conditions I have
observed when stress testing the RAFContainer4 recovery mechanism. It
does some other small cleanups. Regressions ran OK.

RAFContainer:

If we receive an interrupt when the container is first being opened
(i.e. during RAFContainer.run (OPEN_CONTAINER_ACTION) ->
getEmbryonicPage), recovery will fail because currentIdentity needed
in RAFContainer4#recoverContainerAfterInterrupt hasn't yet been
set. 

RAFContainer4:

If a stealthMode read is interrupted and is recovering the container,
it erroneously increments threadsInPageIO just before exiting to retry
IO. This leads to a break in the invariant that threadsInPageIO be 0
when all threads are done, causing issue (hang) down the line.  

If, when we are reopening the container, the read being done during
that operation (getEmbryonicPage), that stealth mode read will also
lead to a (recursive) recovery. We have to catch this case by adding a
"catch (InterruptDetectedException e)" just after the call to
openContainer, not by testing the interrupt flag as presently done,
since the recovery inside the recursive call to
getEmbryonicPage/readPage will already have cleared the flag and done
recovery.

When giving up reopening the container for some reason, we also forgot
to decrement threadsInPageIO.

To guard against other hangs, I will make the while-true loops max out
in all cases. But before I commit that change, it would be nice to see
if this patch has any impact on DERBY-4920 (I suspect not). The reason
I'd like to hold off with that is that since DERBY-4920 occurs during
shutdown, that may mask the fact that recovery failed, cf. my comment
https://issues.apache.org/jira/browse/DERBY-4920?focusedCommentId=12936016&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12936016
So, I'd rather wait with that till I get DERBY-4920  out of the way.
  
> Make Derby work reliably in the presence of thread interrupts
> -------------------------------------------------------------
>
>                 Key: DERBY-4741
>                 URL: https://issues.apache.org/jira/browse/DERBY-4741
>             Project: Derby
>          Issue Type: Bug
>          Components: Store
>    Affects Versions: 10.2.1.6, 10.2.2.0, 10.3.1.4, 10.3.2.1, 10.3.3.0, 10.4.1.3, 10.4.2.0,
10.5.1.1, 10.5.2.0, 10.5.3.0, 10.6.1.0
>            Reporter: Dag H. Wanvik
>            Assignee: Dag H. Wanvik
>         Attachments: derby-4741-a-01-api-interruptstatus.diff, derby-4741-a-01-api-interruptstatus.stat,
derby-4741-a-02-api-interruptstatus.diff, derby-4741-a-02-api-interruptstatus.stat, derby-4741-a-03-api-interruptstatus.diff,
derby-4741-a-03-api-interruptstatus.stat, derby-4741-a-04-api-interruptstatus.diff, derby-4741-a-04-api-interruptstatus.stat,
derby-4741-all+lenient+resurrect.diff, derby-4741-all+lenient+resurrect.stat, derby-4741-b-01-nio.diff,
derby-4741-b-01-nio.stat, derby-4741-b-02-nio.diff, derby-4741-b-02-nio.stat, derby-4741-b-03-nio.diff,
derby-4741-b-03-nio.stat, derby-4741-b-04-nio.diff, derby-4741-b-04-nio.stat, derby-4741-c-01-nio.diff,
derby-4741-c-01-nio.stat, derby-4741-nio-container+log+waits+locks+throws.diff, derby-4741-nio-container+log+waits+locks+throws.stat,
derby-4741-nio-container+log+waits+locks-2.diff, derby-4741-nio-container+log+waits+locks-2.stat,
derby-4741-nio-container+log+waits+locks.diff, derby-4741-nio-container+log+waits+locks.stat,
derby-4741-nio-container+log+waits.diff, derby-4741-nio-container+log+waits.stat, derby-4741-nio-container+log.diff,
derby-4741-nio-container+log.stat, derby-4741-nio-container-2.diff, derby-4741-nio-container-2.log,
derby-4741-nio-container-2.stat, derby-4741-nio-container-2b.diff, derby-4741-nio-container-2b.stat,
derby.log, derby.log, InterruptResilienceTest.java, MicroAPITest.java, xsbt0.log.gz
>
>
> When not executing on a small device VM, Derby has been using the Java NIO classes java.nio.clannel.*
for file io.
> If thread is interrupted while executing blocking IO operations in NIO, the ClosedByInterruptException
will get thrown. Unfortunately, Derby isn't current architected to retry and complete such
operations (before passing on the interrupt), so the Derby database can be left in an inconsistent
state and we therefore have to return a database level error. This means the applications
can no longer access the database without a shutdown and reboot including a recovery.
> It would be nice if Derby could somehow detect and finish IO operations underway when
thread interrupts happen before passing the exception on to the application. Derby embedded
is sometimes embedded in applications that use Thread.interrupt to stop threads.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message