[ https://issues.apache.org/jira/browse/DERBY-6879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15342733#comment-15342733 ] Brett Bergquist commented on DERBY-6879: ---------------------------------------- In XATransactionState.java, we have: /** * Runs the cancel task of the global transaction */ public void run() { try { xaState.cancel(MessageId.CONN_XA_TRANSACTION_TIMED_OUT); } catch (Throwable th) { Monitor.logThrowable(th); } } and the "cancel" method looks like: synchronized void cancel(String messageId) throws XAException { ... try { conn.xa_rollback(); } catch (SQLException sqle) { and the "xa_rollback" method looks like: public final void xa_rollback() throws SQLException { synchronized (getConnectionSynchronization()) { So when the timer expires, the lock pattern is to lock the XATransactionState and then the EmbedConnection object. When an exception such as a lock timeout occurs when using a XA transaction and cleanupOnError code is called, the call stack looks like: "DRDAConnThread_3" #15 prio=5 os_prio=64 tid=0x0000000103d5b000 nid=0x19 waiting for monitor entry [0xffffffff1a9fe000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.derby.jdbc.XATransactionState.cleanupOnError(XATransactionState.java:155) - waiting to lock <0xffffffff4d0aead8> (a org.apache.derby.jdbc.XATransactionState) at org.apache.derby.iapi.services.context.ContextManager.cleanupOnError(ContextManager.java:343) at org.apache.derby.impl.jdbc.TransactionResourceImpl.cleanupOnError(TransactionResourceImpl.java:461) at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(TransactionResourceImpl.java:344) at org.apache.derby.impl.jdbc.EmbedConnection.handleException(EmbedConnection.java:2400) at org.apache.derby.impl.jdbc.ConnectionChild.handleException(ConnectionChild.java:85) at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(EmbedStatement.java:1437) - locked <0xffffffff4d01b698> (a org.apache.derby.impl.jdbc.EmbedConnection40) at org.apache.derby.impl.jdbc.EmbedStatement.execute(EmbedStatement.java:711) - locked <0xffffffff4d01b698> (a org.apache.derby.impl.jdbc.EmbedConnection40) at org.apache.derby.impl.jdbc.EmbedStatement.executeLargeUpdate(EmbedStatement.java:190) at org.apache.derby.iapi.jdbc.BrokeredStatement.executeLargeUpdate(BrokeredStatement.java:633) at org.apache.derby.impl.drda.DRDAConnThread.parseEXCSQLIMM(DRDAConnThread.java:5393) at org.apache.derby.impl.drda.DRDAConnThread.processCommands(DRDAConnThread.java:759) at org.apache.derby.impl.drda.DRDAConnThread.run(DRDAConnThread.java:295) So here the lock pattern is locking the EmbedConnection40 (which is an EmbedConnection) and then the XATransactionState. So we have just the opposite pattern and a deadlock occurs. I have been able to get this to reproduce in a test application. Using a normal non-XA connection, I query and lock many (10000) rows in a large table. Using a separate XA connection, I try to delete those same rows. I have "derby.jdbc.xaTransactionTimeout=30" in my derby.properties to have a very short transaction timeout. I have the "derby.locks.waitTimeout=60" in my derby.properties. When I run the test application, it locks up the 10000 rows in one transaction that is not committed through the first non-XA connection. The second XA connection tries to delete these rows. Since the "derby.jdbc.xaTransactionTimeout" is 30 seconds, the XA transaction is going to timeout at 30 seconds, but the "derby.locks.waitTimeout" is going to cause the XA transaction to try wait for 60 seconds to obtain the locks that it needs. The timeout of the XA transaction is going to cause a SQLTransactionRollbackException to be recognized on the XA connection and it will try to then call the cleanup code. But alas, the EmbedConnection is locked and it will try to lock the XATransactionState in the cleanup code and the XATransactionState is lock in the timeout task and will attempt to call "xa_rollback" on the XA connection. Deadlock. I have a simple workaround and that is to change /** * Runs the cancel task of the global transaction */ public void run() { try { xaState.cancel(MessageId.CONN_XA_TRANSACTION_TIMED_OUT); } catch (Throwable th) { Monitor.logThrowable(th); } } to public void run() { try { synchronized (xaState.conn) { xaState.cancel(MessageId.CONN_XA_TRANSACTION_TIMED_OUT); } } catch (Throwable th) { Monitor.logThrowable(th); } } This changes the timeout task to first lock on the connection and then XATransactionState. I would actually prefer to somehow add a new method to "EmbedConnection" that would acquire its lock which would then call back into XATransactionState.cancel to perform the cancel which would also change the lock pattern, but I am struggling to find out how to get from an EmbedConnection back to the XATransactionState. > Engine deadlock between XA timeout handling and cleanupOnError > -------------------------------------------------------------- > > Key: DERBY-6879 > URL: https://issues.apache.org/jira/browse/DERBY-6879 > Project: Derby > Issue Type: Bug > Components: Services > Affects Versions: 10.10.2.0 > Environment: Solaris 10.5 on Oracle M5000 > Reporter: Brett Bergquist > > Deadlock between XA timer cleanup task and the ContextManager.cleanupOnError > Found one Java-level deadlock: > ============================= > "DRDAConnThread_34": > waiting to lock monitor 0x0000000104b14d18 (object 0xfffffffd9090f058, a org.apache.derby.jdbc.XATransactionState), > which is held by "Timer-0" > "Timer-0": > waiting to lock monitor 0x00000001038b96e8 (object 0xfffffffd9090d8b0, a org.apache.derby.impl.jdbc.EmbedConnection40), > which is held by "DRDAConnThread_34" > > Java stack information for the threads listed above: > =================================================== > "DRDAConnThread_34": > at org.apache.derby.jdbc.XATransactionState.cleanupOnError(Unknown Source) > - waiting to lock <0xfffffffd9090f058> (a org.apache.derby.jdbc.XATransactionState) > at org.apache.derby.iapi.services.context.ContextManager.cleanupOnError(Unknown Source) > at org.apache.derby.impl.jdbc.TransactionResourceImpl.cleanupOnError(Unknown Source) > at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown Source) > at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source) > at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown Source) > at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(Unknown Source) > - locked <0xfffffffd9090d8b0> (a org.apache.derby.impl.jdbc.EmbedConnection40) > at org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(Unknown Source) > at org.apache.derby.impl.jdbc.EmbedPreparedStatement.execute(Unknown Source) > at org.apache.derby.iapi.jdbc.BrokeredPreparedStatement.execute(Unknown Source) > at org.apache.derby.impl.drda.DRDAStatement.execute(Unknown Source) > at org.apache.derby.impl.drda.DRDAConnThread.parseEXCSQLSTTobjects(Unknown Source) > at org.apache.derby.impl.drda.DRDAConnThread.parseEXCSQLSTT(Unknown Source) > at org.apache.derby.impl.drda.DRDAConnThread.processCommands(Unknown Source) > at org.apache.derby.impl.drda.DRDAConnThread.run(Unknown Source) > "Timer-0": > at org.apache.derby.impl.jdbc.EmbedConnection.xa_rollback(Unknown Source) > - waiting to lock <0xfffffffd9090d8b0> (a org.apache.derby.impl.jdbc.EmbedConnection40) > at org.apache.derby.jdbc.XATransactionState.cancel(Unknown Source) > - locked <0xfffffffd9090f058> (a org.apache.derby.jdbc.XATransactionState) > at org.apache.derby.jdbc.XATransactionState$CancelXATransactionTask.run(Unknown Source) > at java.util.TimerThread.mainLoop(Timer.java:555) > at java.util.TimerThread.run(Timer.java:505) > > Found 1 deadlock. > This deadlock caused Derby to create 18000 transaction recovery logs because of the XA transaction that did not cleanup in the timeout. Rebooting the system would cause a 50 hour boot up time to process the transaction logs so recovery had to be done by going to a backup database before the issue occurred. -- This message was sent by Atlassian JIRA (v6.3.4#6332)