db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dag H. Wanvik (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (DERBY-4741) Make Derby work reliably in the presence of thread interrupts
Date Thu, 11 Nov 2010 15:43:14 GMT

    [ https://issues.apache.org/jira/browse/DERBY-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929885#action_12929885
] 

Dag H. Wanvik edited comment on DERBY-4741 at 11/11/10 10:42 AM:
-----------------------------------------------------------------

Uploading patch derby-4741-b-01-nio. This patch contains changes to
FileContainer/RAFContainer/RAFContainer4 to allow page IO to recover
from interrupts seen in RAFContainer4 (the NIO[1] version
specialization of RAFContainer). Additionally, some waits are changed
to retry after interrupts (yes, there are more of those in other
classes, that's for a later patch, but since I was in there anyway..)

The main thrust of the patch is in RAFContainer4. Upon seeing
exceptions from IO (readPage/writePage), a thread will determine if it
is
        a) the thread that caused the channel to beclome closed, or
        b) it suffered an exceptions because another thread caused the
           channel to beclome closed.

If a), the thread will recover by reopening the container before
reattempting IO, cf. the method recoverContainerAfterInterrupt. If b)
the thread will yield, wait and retry until the container has been
recovered, or a minute has passed. After a minute, we give up and
throw FILE_IO_INTERRUPTED. I chose a minute somewhat arbitrarily, it
is hopefully sufficient (?).

The recovery thread and other waiting threads normally synchronize the
operations via variables protected by the channelCleanupMonitor
monitor (but see use of volatile and associated comments).

The retry logic happens in one of three places, in increasing
closeness to RAFContainer4:

    a) if the thread owns the monitor on allocCache and it is not the
       one that is doing recovery, it will back up to FileContainer
       before retrying the IO, since it has to release the lock on
       allocCache, because that monitor is needed by the thread doing
       recovery. Cf. changes in FileContainer.  If the thread holds
       allocCache, it goes into stealth mode in readPage/WritePage:
       this means that the recovery thread doesn't see it. We can't
       let a stealth mode thread grab channelCleanupMonitor, as this
       could cause a dead-lock. In c) the recovery thread keeps track
       of other threads, waking them up when recovery is done.

    b) if the thread owns the monitor on "this", again the thread goes
       into stealth mode. If it is not the one to clean up, it will
       back up to RAFContainer#clean's loop, since it has to release
       the lock on "this", because that monitor is also needed by the
       thread doing recovery. This only ever happens for readPage (not
       writePage), cf. call to getEmbryonicPage in WriteRAFHeader
       called from RAFContainer#clean.

    c) in other cases, the thread will call awaitRestoreChannel before
       reattempting IO.

The logic of RAFContainer4#getEmbryonicPage has been folded into
readPage in order for it to be covered by the interrupt treatement.

The test Derby151Test has been removed, since it is no longer
relevant. A new test, InterruptResilienceTest, fails without the
patch, but should work with the patch in place. It interrupts the
single app thread before doing IO, which will trip NIO and cause
recovery. Even though we have only one app thread, the test still
revealed concurrency issues during the development of this patch,
since RawStoreThread will also do IO on the container.

A note on the implementation: Sun Java NIO code (including the
latest 1.6) has a bug which are worked around by the patch in two
places, cf. code comments:

       - readFull/writeFull: if a thread has been interrupted,
         occasionally it closes the channel but does not throw,
         cf. http://bugs.sun.com/view_bug.do?bug_id=6979009
       - readPage/WritePage: sometimes the interrupted thread throws
         the wrong exception: AsynchronousCloseException instead of
         the more specialized ClosedByInterruptException.

[1] Note: DirRandomAccessFile4 also contain use of NIO, I'll address
that in a later patch.

I will also be adding more tests. A problem for making tests for
interrupts is to get coverage without making tests unstable, so I'll
wait until I have more holes plugged.

Regressions worked earlier tonight in a slightly different version, rerunning now.



      was (Author: dagw):
    Uploading patch derby-4741-b-01-nio. This patch contains changes to
FileContainer/RAFContainer/RAFContainer4 to allow page IO to recover
from interrupts seen in RAFContainer4 (the NIO[1] version
specialization of RAFContainer). Additionally, some waits are changed
to retry after interrupts (yes, there are more of those in other
classes, that's for a later patch, but since I was in there anyway..)

The main thrust of the patch is in RAFContainer4. Upon seeing
exceptions from IO (readPage/writePage), a thread will determine if it
is
        a) the thread that caused the channel to beclome closed, or
        b) it suffered an exceptions because another thread caused the
           channel to beclome closed.

If a), the thread will recover by reopening the container before
reattempting IO, cf. the method recoverContainerAfterInterrupt. If b)
the thread will yield, wait and retry until the container has been
recovered, or a minute has passed. After a minute, we give up and
throw FILE_IO_INTERRUPTED. I chose a minute somewhat arbitrarily, it
is hopefully sufficient (?).

The recovery thread and other waiting threads normally synchronize the
operations via variables protected by the channelCleanupMonitor
monitor (but see use of volatile and associated comments).

The retry logic happens in one of three places, in increasing
closeness to RAFContainer4:

    a) if the thread owns the monitor on allocCache and it is not the
       one that is doing recovery, it will back up to FileContainer
       before retrying the IO, since it has to release the lock on
       allocCache, because that monitor is needed by the thread doing
       recovery. Cf. changes in FileContainer.  If the thread holds
       allocCache, it goes into stealth mode in readPage/WritePage:
       this means that the recovery thread doesn't see it. We can't
       let a stealth mode thread grab channelCleanupMonitor, as this
       could cause a dead-lock. In c) the recovery thread keeps track
       of other threads, waking them up when recovery is done.

    b) if the thread owns the monitor on "this", again the thread goes
       into stealth mode. If it is not the one to clean up, it will
       back up to RAFContainer#clean's loop, since it has to release
       the lock on "this", because that monitor is also needed by the
       thread doing recovery. This only ever happens for readPage (not
       writePage), cf. call to getEmbryonicPage in WriteRAFHeader
       called from RAFContainer#clean.

    c) in other cases, the thread will call awaitRestoreChannel before
       reattempting IO.

The logic of RAFContainer4#getEmbryonicPage has been folded into
readPage in order for it to be covered by the interrupt treatement.

The test Derby151Test has been removed, since it is no longer
relevant. A new test, InterruptResilienceTest, fails without the
patch, but should work with the patch in place. It interrupts the
single app thread before doing IO, which will trip NIO and cause
recovery. Even though we have only one app thread, the test still
revealed concurrency issues during the development of this patch,
since RawStoreThread will also do IO on the container.

A note on the implementation: Sun Java NIO code (including the
latest 1.6) has a bug which are worked around by the patch in two
places, cf. code comments:

       - readFull/writeFull: if a thread has been interrupted,
         occasionally it closes the channel but does not throw,
         cf. http://stackoverflow.com/questions/3541411/closedbyinterruptexception-not-thrown
       - readPage/WritePage: sometimes the interrupted thread throws
         the wrong exception: AsynchronousCloseException instead of
         the more specialized ClosedByInterruptException.

[1] Note: DirRandomAccessFile4 also contain use of NIO, I'll address
that in a later patch.

I will also be adding more tests. A problem for making tests for
interrupts is to get coverage without making tests unstable, so I'll
wait until I have more holes plugged.

Regressions worked earlier tonight in a slightly different version, rerunning now.


  
> Make Derby work reliably in the presence of thread interrupts
> -------------------------------------------------------------
>
>                 Key: DERBY-4741
>                 URL: https://issues.apache.org/jira/browse/DERBY-4741
>             Project: Derby
>          Issue Type: Bug
>          Components: Store
>    Affects Versions: 10.2.1.6, 10.2.2.0, 10.3.1.4, 10.3.2.1, 10.3.3.0, 10.4.1.3, 10.4.2.0,
10.5.1.1, 10.5.2.0, 10.5.3.0, 10.6.1.0
>            Reporter: Dag H. Wanvik
>            Assignee: Dag H. Wanvik
>         Attachments: derby-4741-a-01-api-interruptstatus.diff, derby-4741-a-01-api-interruptstatus.stat,
derby-4741-a-02-api-interruptstatus.diff, derby-4741-a-02-api-interruptstatus.stat, derby-4741-a-03-api-interruptstatus.diff,
derby-4741-a-03-api-interruptstatus.stat, derby-4741-a-04-api-interruptstatus.diff, derby-4741-a-04-api-interruptstatus.stat,
derby-4741-all+lenient+resurrect.diff, derby-4741-all+lenient+resurrect.stat, derby-4741-b-01-nio.diff,
derby-4741-b-01-nio.stat, derby-4741-nio-container+log+waits+locks+throws.diff, derby-4741-nio-container+log+waits+locks+throws.stat,
derby-4741-nio-container+log+waits+locks-2.diff, derby-4741-nio-container+log+waits+locks-2.stat,
derby-4741-nio-container+log+waits+locks.diff, derby-4741-nio-container+log+waits+locks.stat,
derby-4741-nio-container+log+waits.diff, derby-4741-nio-container+log+waits.stat, derby-4741-nio-container+log.diff,
derby-4741-nio-container+log.stat, derby-4741-nio-container-2.diff, derby-4741-nio-container-2.log,
derby-4741-nio-container-2.stat, derby-4741-nio-container-2b.diff, derby-4741-nio-container-2b.stat,
derby.log, derby.log, InterruptResilienceTest.java, MicroAPITest.java, xsbt0.log.gz
>
>
> When not executing on a small device VM, Derby has been using the Java NIO classes java.nio.clannel.*
for file io.
> If thread is interrupted while executing blocking IO operations in NIO, the ClosedByInterruptException
will get thrown. Unfortunately, Derby isn't current architected to retry and complete such
operations (before passing on the interrupt), so the Derby database can be left in an inconsistent
state and we therefore have to return a database level error. This means the applications
can no longer access the database without a shutdown and reboot including a recovery.
> It would be nice if Derby could somehow detect and finish IO operations underway when
thread interrupts happen before passing the exception on to the application. Derby embedded
is sometimes embedded in applications that use Thread.interrupt to stop threads.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message