db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kristian Waagan (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (DERBY-3639) Slave on Windows stops replication when network connection is broken, failover fails
Date Tue, 07 Jun 2011 08:14:59 GMT

     [ https://issues.apache.org/jira/browse/DERBY-3639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Kristian Waagan updated DERBY-3639:
-----------------------------------

    Attachment: derby-3639-1a-handle_connection_reset.diff

Attaching patch 1a, which makes the slave controller interpret a SocketException with a message
starting with "Connection reset" as a disconnect. When running the full regression suite I
got one error in the replication tests, but running the suite manually afterwards resulted
in no errors:
testReplication_Local_StateTest_part1_2(org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_StateTest_part1_2)
ERROR:
java.sql.SQLException: DERBY SQL error: SQLCODE: -1, SQLSTATE: XJ001, SQLERRMC: org.apache.derby.shared.common.sanity.AssertFailure^TASSERT
FAILED connection is closed^TXJ001.U
        at org.apache.derby.client.am.SQLExceptionFactory40.getSQLException(SQLExceptionFactory40.java:99)
        at org.apache.derby.client.am.SqlException.getSQLException(SqlException.java:358)
        at org.apache.derby.jdbc.ClientDriver.connect(ClientDriver.java:166)
        at java.sql.DriverManager.getConnection(DriverManager.java:582)
        at java.sql.DriverManager.getConnection(DriverManager.java:207)
        at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_StateTest_part1_2._testPostStartedMasterAndSlave_StopMaster(ReplicationRun_Local_StateTest_part1_2.java:226)
        at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_StateTest_part1_2.testReplication_Local_StateTest_part1_2(ReplicationRun_Local_StateTest_part1_2.java:137)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at junit.framework.TestCase.runTest(TestCase.java:164)
        at junit.framework.TestCase.runBare(TestCase.java:130)
        at org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:112)
        at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun.runBare(ReplicationRun.java:207)
        at junit.framework.TestResult$1.protect(TestResult.java:106)
        at junit.framework.TestResult.runProtected(TestResult.java:124)
        at junit.framework.TestResult.run(TestResult.java:109)
        at junit.framework.TestCase.run(TestCase.java:120)
        at junit.framework.TestSuite.runTest(TestSuite.java:230)
        at junit.framework.TestSuite.run(TestSuite.java:225)
        at junit.framework.TestSuite.runTest(TestSuite.java:230)
        at junit.framework.TestSuite.run(TestSuite.java:225)
        at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
        at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
        at junit.framework.TestResult.runProtected(TestResult.java:124)
        at junit.extensions.TestSetup.run(TestSetup.java:25)
        at junit.framework.TestSuite.runTest(TestSuite.java:230)
        at junit.framework.TestSuite.run(TestSuite.java:225)
        at kah.TestRunner.main(TestRunner.java:77)
Caused by: org.apache.derby.client.am.SqlException: DERBY SQL error: SQLCODE: -1, SQLSTATE:
XJ001, SQLERRMC: org.apache.derby.shared.common.sanity.AssertFailure^TASSERT FAILED connection
is closed^TXJ001.U
        at org.apache.derby.client.am.Connection.completeSqlca(Connection.java:2125)
        at org.apache.derby.client.net.NetConnectionReply.parseRdbAccessFailed(NetConnectionReply.java:538)
        at org.apache.derby.client.net.NetConnectionReply.parseAccessRdbError(NetConnectionReply.java:431)
        at org.apache.derby.client.net.NetConnectionReply.parseACCRDBreply(NetConnectionReply.java:294)
        at org.apache.derby.client.net.NetConnectionReply.readAccessDatabase(NetConnectionReply.java:121)
        at org.apache.derby.client.net.NetConnection.readSecurityCheckAndAccessRdb(NetConnection.java:848)
        at org.apache.derby.client.net.NetConnection.flowSecurityCheckAndAccessRdb(NetConnection.java:771)
        at org.apache.derby.client.net.NetConnection.flowUSRIDONLconnect(NetConnection.java:603)
        at org.apache.derby.client.net.NetConnection.flowConnect(NetConnection.java:410)
        at org.apache.derby.client.net.NetConnection.<init>(NetConnection.java:220)
        at org.apache.derby.client.net.NetConnection40.<init>(NetConnection40.java:74)
        at org.apache.derby.client.net.ClientJDBCObjectFactoryImpl40.newNetConnection(ClientJDBCObjectFactoryImpl40.java:269)
        at org.apache.derby.jdbc.ClientDriver.connect(ClientDriver.java:157)
        ... 27 more


I'll run the full suite again to make sure this isn't caused by the patch. I have also manually
tested the patch on Windows (with Solaris 11 Express as the master).

Patch ready for review.


Brett,
I don't know the exact details here, but cursory searches indicate that this is caused by
the network stack behaving differently on Windows. Most of the messages were regarding closing
the connection during normal operation, where RST was sent instead of FIN, but there seems
to be a similar issue when the cable is unplugged.

A test on my UNIX laptop suggested that it too was able to detect an unplugged cable instantly,
but I don't know what actually happened on the protocol level.

> Slave on Windows stops replication when network connection is broken, failover fails
> ------------------------------------------------------------------------------------
>
>                 Key: DERBY-3639
>                 URL: https://issues.apache.org/jira/browse/DERBY-3639
>             Project: Derby
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 10.4.1.3
>         Environment: Master on Solaris 10 x86, Slave on Windows XP SP2.
> Slave VM: Sun's Java HotSpot Client VM build 1.6.0_03-b05.
>            Reporter: John H. Embretsen
>            Assignee: Kristian Waagan
>            Priority: Minor
>              Labels: derby_triage10_5_2
>         Attachments: derby-3639-1a-handle_connection_reset.diff, derby.log
>
>
> Replication: Failover on slave fails after network connection is broken (network cable
to slave pulled out); database is shut down due to an "unexpected error". 
> Same experiment with a replicated embedded database on Linux (FC5) and Windows resulted
in success on Linux and failure on Windows.
> Documentation (admin guide, "Replication failure handling") says: 
> "Slave loses connection with master: The slave tries to reestablish the connection with
the master by listening on the specified host and port. It will not give up until it is explicitly
requested to do so by either the failover=true or stopSlave=true connection URL attribute.
If a failover is requested, the slave applies all received log records and boots the database
as described in Forcing a failover."
> Slave console:
> java -jar lib\derbyrun.jar ij
> ij version 10.4
> ij> connect 'jdbc:derby:replicDB;startSlave=true;slaveHost=0.0.0.0';
> ERROR XRE08: Replication slave mode started successfully for database 'replicDB'. Connection
refused because the database is in replication slave mode
> .
> ij> connect 'jdbc:derby:replicDB';
> ERROR 08004: Connection refused to database 'replicDB' because it is in replication slave
mode.
> ij> -- network cable unplugged from slave
> ij> connect 'jdbc:derby:replicDB;failover=true';
> ERROR XRE11: Could not perform operation 'failover' because the database 'replicDB' has
not been booted.
> ij> connect 'jdbc:derby:replicDB';
> ij>
> The slave's derby.log reported the following after the network cable was pulled and the
failover command was issued:
> ----  BEGIN REPLICATION ERROR MESSAGE (4/22/08 2:01 PM) ----
> Replication slave got a fatal error for database 'replicDB'. Replication will be stopped.
> ERROR XRE03: Unexpected replication error. See derby.log for details.
> 	at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
> 	at org.apache.derby.impl.store.replication.slave.SlaveController$SlaveLogReceiverThread.run(Unknown
Source)
> Caused by: java.net.SocketException: Connection reset
> 	at java.net.SocketInputStream.read(Unknown Source)
> 	at java.net.SocketInputStream.read(Unknown Source)
> 	at java.io.ObjectInputStream$PeekInputStream.peek(Unknown Source)
> 	at java.io.ObjectInputStream$BlockDataInputStream.peek(Unknown Source)
> 	at java.io.ObjectInputStream$BlockDataInputStream.peekByte(Unknown Source)
> 	at java.io.ObjectInputStream.readObject0(Unknown Source)
> 	at java.io.ObjectInputStream.readObject(Unknown Source)
> 	at org.apache.derby.impl.store.replication.net.SocketConnection.readMessage(Unknown
Source)
> 	at org.apache.derby.impl.store.replication.net.ReplicationMessageReceive.readMessage(Unknown
Source)
> [snipped further stack traces]
> --------------------  END REPLICATION ERROR MESSAGE ---------------------
> Replication slave role was stopped for database 'replicDB'.
> Replication slave role was stopped for database 'replicDB'.
> 2008-04-22 12:01:42.921 GMT:
> Shutting down instance 601a400f-0119-7600-eb23-000000383460
> ----------------------------------------------------------------
> Full derby.log from the slave is attached.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message