db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dag H. Wanvik (JIRA)" <j...@apache.org>
Subject [jira] Updated: (DERBY-4920) suites.All stuck in RAFContainer4.awaitRestoreChannel()
Date Mon, 13 Dec 2010 18:32:10 GMT

     [ https://issues.apache.org/jira/browse/DERBY-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dag H. Wanvik updated DERBY-4920:
---------------------------------

    Attachment: derby-4943-a.stat
                derby-4920-a.diff

I was finally able to track this one down by instrumenting RAFContainer4 to collect a bit
of history, cf. uploaded patch "RAFContainer4-instrumentation". (Thanks to Knut for noticing
it happens more often when running on an in-memory file system e.g. /tmp on Solaris). Turned
out the timing was slightly different there: instead of happening very infrequently, it happened
more often than not when running the test that we saw hang: LobSortTest.

(Aside: trying to reproduce on Solaris made it even harder since the interrupt would sometimes
get swallowed by non-NIO IO calls: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6871306)

In any case, this is what happens: The test shuts down the database to modify static properties
("SystemPropertyTestSetup") after having created the table MIXED_LOBS in CleanDatabaseTestSetup#decorateSQL.
The raw store demon thread is performing a check-point at the time the user thread shuts down
Derby.  The shutdown interrupts the demon while it is doing NIO. The demon tries to recover
the channel. This fails due a ShutdownException thrown when InterruptStatus.setInterrupted
tries to find the lcc by calling ContextService#getContextOrNull. (RAFContainer4#recoverContainerAfterInterrupt
calls noteAndClearInterrupt. The latter in turn calls setInterrupted.)

Later during shutdown, the user thread detects that there are still dirty pages and tries
(again) to perform the check-point, but now the channel is closed, since the demon's recovery
failed. This leads to the hang because the user thread keeps waiting for recovery to happen.

The fix is simple, just make ContextService#getContextOrNull ignore ShutdownException (as
we did for DERBY-4911).

Note that this conforms my hypothesis that the check-pointing at shutdown has been faulty
since we introduced NIO containers in 10.2: The interrupted state of the channel was masked
by the shutdown.  It would only ever be an issue if there was a background thread doing check-pointing
at shutdown time, however. The next reboot would read the logs and recover, of course, so
that's probably why we never have had a report on this.


Patch details:

* InterruptStatus.java

Make sure we always catch ShutdownException when we access the context looking for lcc.

* RAFContainer4.java

Improved some comments, added a final, added a max count for a retry loop that missed it:
will throw FILE_IO_INTERRUPTED if it exceeds MAX_INTERRUPT_RETRIES (as elsewhere when we give
up).

Running regressions again.


> suites.All stuck in RAFContainer4.awaitRestoreChannel()
> -------------------------------------------------------
>
>                 Key: DERBY-4920
>                 URL: https://issues.apache.org/jira/browse/DERBY-4920
>             Project: Derby
>          Issue Type: Bug
>          Components: Store
>    Affects Versions: 10.8.0.0
>            Reporter: Knut Anders Hatlen
>            Assignee: Dag H. Wanvik
>         Attachments: derby-4920-a.diff, DERBY-4920_db2compatibility.txt, derby-4943-a.stat,
jstack.19071.txt
>
>
> I have suites.All running on trunk, and it seems to be stuck in a call to RAFContainer4.awaitRestoreChannel().
Here's the stack for the waiting thread:
> "main" prio=3 tid=0x08070800 nid=0x2 waiting on condition [0xfe61e000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
>         at java.lang.Thread.sleep(Native Method)
>         at org.apache.derby.impl.store.raw.data.RAFContainer4.awaitRestoreChannel(RAFContainer4.java:698)
>         at org.apache.derby.impl.store.raw.data.RAFContainer4.writePage(RAFContainer4.java:593)
>         at org.apache.derby.impl.store.raw.data.CachedPage.writePage(CachedPage.java:787)
>         at org.apache.derby.impl.store.raw.data.CachedPage.clean(CachedPage.java:610)
>         at org.apache.derby.impl.services.cache.ConcurrentCache.cleanAndUnkeepEntry(ConcurrentCache.java:551)
>         at org.apache.derby.impl.services.cache.ConcurrentCache.cleanCache(ConcurrentCache.java:509)
>         at org.apache.derby.impl.services.cache.ConcurrentCache.cleanAll(ConcurrentCache.java:460)
>         at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.checkpoint(BaseDataFileFactory.java:1211)
>         at org.apache.derby.impl.store.raw.log.LogToFile.checkpointWithTran(LogToFile.java:1710)
>         at org.apache.derby.impl.store.raw.log.LogToFile.checkpoint(LogToFile.java:1507)
>         at org.apache.derby.impl.store.raw.RawStore.stop(RawStore.java:368)
>         at org.apache.derby.impl.services.monitor.TopService.stop(TopService.java:442)
>         at org.apache.derby.impl.services.monitor.TopService.shutdown(TopService.java:393)
>         at org.apache.derby.impl.services.monitor.BaseMonitor.shutdown(BaseMonitor.java:229)
>         at org.apache.derby.impl.services.monitor.BaseMonitor.shutdown(BaseMonitor.java:199)
>         at org.apache.derby.jdbc.InternalDriver.connect(InternalDriver.java:231)
>         at org.apache.derby.jdbc.AutoloadedDriver.connect(AutoloadedDriver.java:119)
>         at java.sql.DriverManager.getConnection(DriverManager.java:582)
>         at java.sql.DriverManager.getConnection(DriverManager.java:154)
>         at org.apache.derbyTesting.junit.DriverManagerConnector.getConnectionByAttributes(DriverManagerConnector.java:137)
>         at org.apache.derbyTesting.junit.DriverManagerConnector.shutEngine(DriverManagerConnector.java:120)
>         at org.apache.derbyTesting.junit.TestConfiguration.shutdownEngine(TestConfiguration.java:1599)
>         at org.apache.derbyTesting.junit.SystemPropertyTestSetup.setUp(SystemPropertyTestSetup.java:83)
>         at junit.extensions.TestSetup$1.protect(TestSetup.java:20)
>         at junit.framework.TestResult.runProtected(TestResult.java:124)
>         at junit.extensions.TestSetup.run(TestSetup.java:25)
>         at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
>         at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
>         at junit.framework.TestResult.runProtected(TestResult.java:124)
>         at junit.extensions.TestSetup.run(TestSetup.java:25)
>         at org.apache.derbyTesting.junit.BaseTestSetup.run(BaseTestSetup.java:57)
>         at junit.framework.TestSuite.runTest(TestSuite.java:230)
>         at junit.framework.TestSuite.run(TestSuite.java:225)
>         at junit.framework.TestSuite.runTest(TestSuite.java:230)
>         at junit.framework.TestSuite.run(TestSuite.java:225)
>         at junit.framework.TestSuite.runTest(TestSuite.java:230)
>         at junit.framework.TestSuite.run(TestSuite.java:225)
>         at junit.textui.TestRunner.doRun(TestRunner.java:121)
>         at junit.textui.TestRunner.start(TestRunner.java:185)
>         at junit.textui.TestRunner.main(TestRunner.java:143)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message