db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brett Bergquist (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DERBY-6879) Engine deadlock between XA timeout handling and cleanupOnError
Date Tue, 21 Jun 2016 21:19:58 GMT

    [ https://issues.apache.org/jira/browse/DERBY-6879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15342733#comment-15342733

Brett Bergquist commented on DERBY-6879:

In XATransactionState.java, we have:

         * Runs the cancel task of the global transaction
        public void run() {
            try {
            } catch (Throwable th) {

and the "cancel" method looks like:

    synchronized void cancel(String messageId) throws XAException {
            try {
            } catch (SQLException sqle) {

and the "xa_rollback" method looks like:

	public final void xa_rollback() throws SQLException {
		synchronized (getConnectionSynchronization())

So when the timer expires, the lock pattern is to lock the XATransactionState and then the
EmbedConnection object.   

When an exception such as a lock timeout occurs when using a XA transaction and cleanupOnError
code is called, the call stack looks like:

"DRDAConnThread_3" #15 prio=5 os_prio=64 tid=0x0000000103d5b000 nid=0x19 waiting for monitor
entry [0xffffffff1a9fe000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.apache.derby.jdbc.XATransactionState.cleanupOnError(XATransactionState.java:155)
	- waiting to lock <0xffffffff4d0aead8> (a org.apache.derby.jdbc.XATransactionState)
	at org.apache.derby.iapi.services.context.ContextManager.cleanupOnError(ContextManager.java:343)
	at org.apache.derby.impl.jdbc.TransactionResourceImpl.cleanupOnError(TransactionResourceImpl.java:461)
	at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(TransactionResourceImpl.java:344)
	at org.apache.derby.impl.jdbc.EmbedConnection.handleException(EmbedConnection.java:2400)
	at org.apache.derby.impl.jdbc.ConnectionChild.handleException(ConnectionChild.java:85)
	at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1437)
	- locked <0xffffffff4d01b698> (a org.apache.derby.impl.jdbc.EmbedConnection40)
	at org.apache.derby.impl.jdbc.EmbedStatement.execute(EmbedStatement.java:711)
	- locked <0xffffffff4d01b698> (a org.apache.derby.impl.jdbc.EmbedConnection40)
	at org.apache.derby.impl.jdbc.EmbedStatement.executeLargeUpdate(EmbedStatement.java:190)
	at org.apache.derby.iapi.jdbc.BrokeredStatement.executeLargeUpdate(BrokeredStatement.java:633)
	at org.apache.derby.impl.drda.DRDAConnThread.parseEXCSQLIMM(DRDAConnThread.java:5393)
	at org.apache.derby.impl.drda.DRDAConnThread.processCommands(DRDAConnThread.java:759)
	at org.apache.derby.impl.drda.DRDAConnThread.run(DRDAConnThread.java:295)

So here the lock pattern is locking the EmbedConnection40 (which is an EmbedConnection) and
then the XATransactionState.   

So we have just the opposite pattern and a deadlock occurs.   

I have been able to get this to reproduce in a test application.   Using a normal non-XA connection,
I query and lock many (10000) rows in a large table.   Using a separate XA connection, I try
to delete those same rows.  I have "derby.jdbc.xaTransactionTimeout=30" in my derby.properties
to have a very short transaction timeout.  I have the "derby.locks.waitTimeout=60" in my derby.properties.

When I run the test application, it locks up the 10000 rows in one transaction that is not
committed through the first non-XA connection.   The second XA connection tries to delete
these rows.  Since the "derby.jdbc.xaTransactionTimeout" is 30 seconds, the XA transaction
is going to timeout at 30 seconds, but the "derby.locks.waitTimeout" is going to cause the
XA transaction to try wait for 60 seconds to obtain the locks that it needs. 

The timeout of the XA transaction is going to cause a SQLTransactionRollbackException to be
recognized on the XA connection and it will try to then call the cleanup code.   But alas,
the EmbedConnection is locked and it will try to lock the XATransactionState in the cleanup
code and the XATransactionState is lock in the timeout task and will attempt to call "xa_rollback"
on the XA connection.  Deadlock.

I have a simple workaround and that is to change

         * Runs the cancel task of the global transaction
        public void run() {
            try {
            } catch (Throwable th) {


        public void run() {
            try {
                synchronized (xaState.conn) {
            } catch (Throwable th) {

This changes the timeout task to first lock on the connection and then XATransactionState.

I would actually prefer to somehow add a new method to "EmbedConnection" that would acquire
its lock which would then call back into XATransactionState.cancel to perform the cancel which
would also change the lock pattern, but I am struggling to find out how to get from an EmbedConnection
back to the XATransactionState.

> Engine deadlock between XA timeout handling and cleanupOnError
> --------------------------------------------------------------
>                 Key: DERBY-6879
>                 URL: https://issues.apache.org/jira/browse/DERBY-6879
>             Project: Derby
>          Issue Type: Bug
>          Components: Services
>    Affects Versions:
>         Environment: Solaris 10.5 on Oracle M5000 
>            Reporter: Brett Bergquist
> Deadlock between XA timer cleanup task and the ContextManager.cleanupOnError
> Found one Java-level deadlock:
> =============================
> "DRDAConnThread_34":
>   waiting to lock monitor 0x0000000104b14d18 (object 0xfffffffd9090f058, a org.apache.derby.jdbc.XATransactionState),
>   which is held by "Timer-0"
> "Timer-0":
>   waiting to lock monitor 0x00000001038b96e8 (object 0xfffffffd9090d8b0, a org.apache.derby.impl.jdbc.EmbedConnection40),
>   which is held by "DRDAConnThread_34"
> Java stack information for the threads listed above:
> ===================================================
> "DRDAConnThread_34":
>      at org.apache.derby.jdbc.XATransactionState.cleanupOnError(Unknown Source)
>      - waiting to lock <0xfffffffd9090f058> (a org.apache.derby.jdbc.XATransactionState)
>      at org.apache.derby.iapi.services.context.ContextManager.cleanupOnError(Unknown
>      at org.apache.derby.impl.jdbc.TransactionResourceImpl.cleanupOnError(Unknown Source)
>      at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown Source)
>      at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source)
>      at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown Source)
>      at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(Unknown Source)
>      - locked <0xfffffffd9090d8b0> (a org.apache.derby.impl.jdbc.EmbedConnection40)
>      at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(Unknown Source)
>      at org.apache.derby.impl.jdbc.EmbedPreparedStatement.execute(Unknown Source)
>      at org.apache.derby.iapi.jdbc.BrokeredPreparedStatement.execute(Unknown Source)
>      at org.apache.derby.impl.drda.DRDAStatement.execute(Unknown Source)
>      at org.apache.derby.impl.drda.DRDAConnThread.parseEXCSQLSTTobjects(Unknown Source)
>      at org.apache.derby.impl.drda.DRDAConnThread.parseEXCSQLSTT(Unknown Source)
>      at org.apache.derby.impl.drda.DRDAConnThread.processCommands(Unknown Source)
>      at org.apache.derby.impl.drda.DRDAConnThread.run(Unknown Source)
> "Timer-0":
>      at org.apache.derby.impl.jdbc.EmbedConnection.xa_rollback(Unknown Source)
>      - waiting to lock <0xfffffffd9090d8b0> (a org.apache.derby.impl.jdbc.EmbedConnection40)
>      at org.apache.derby.jdbc.XATransactionState.cancel(Unknown Source)
>      - locked <0xfffffffd9090f058> (a org.apache.derby.jdbc.XATransactionState)
>      at org.apache.derby.jdbc.XATransactionState$CancelXATransactionTask.run(Unknown
>      at java.util.TimerThread.mainLoop(Timer.java:555)
>      at java.util.TimerThread.run(Timer.java:505)
> Found 1 deadlock.
> This deadlock caused Derby to create 18000 transaction recovery logs because of the XA
transaction that did not cleanup in the timeout.  Rebooting the system would cause a 50 hour
boot up time to process the transaction logs so recovery had to be done by going to a backup
database before the issue occurred.

This message was sent by Atlassian JIRA

View raw message