db-derby-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dag H. Wanvik (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (DERBY-4175) Instability in some replication tests under load, since tests don't wait long enough for final state or anticipate intermediate states
Date Fri, 24 Apr 2009 16:11:31 GMT

    [ https://issues.apache.org/jira/browse/DERBY-4175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702427#action_12702427
] 

Dag H. Wanvik edited comment on DERBY-4175 at 4/24/09 9:10 AM:
---------------------------------------------------------------

Thanks for looking at this Knut; I agree, I had done the same
simplification in another version on the waiting loop logic (new in
spin #3 of this patch) when I saw your comment :)

Uploading version # of this patch, which adds another instance of a a
missing wait; which made the test
ReplicationRun_Local_StateTest_part1_2 fail if the machine was loaded.

The patch also increases wait from 0.1s to 1 seconds in two instances,
which did have waiting loops already, but waitined no longer than 2
seconds in all; which was too little under load on my box.

I generally make the waiting loops wait do 20 attempts, waiting 1 second
between each attempt now. That should be sufficient even for a
reasonably loaded machine :)

I also replaced the commented out "System.err.println"s with the debug
util call that's used in these tests to keep things uniform, although
such calls should probably in time be replaced by the normal
BaseTestCase.println (outside scope here).

      was (Author: dagw):
    Thanks for looking at this Knut; I agree, I had done the same
simplification in another version on the waiting loop logic (new in
spin # of this patch) when I saw your comment :)

Uploading version # of this patch, which adds another instance of a a
missing wait; which made the test
ReplicationRun_Local_StateTest_part1_2 fail if the machine was loaded.

The patch also increases wait from 0.1s to 1 seconds in two instances,
which did have waiting loops already, but waitined no longer than 2
seconds in all; which was too little under load on my box.

I generally make the waiting loops wait do 20 attempts, waiting 1 second
between each attempt now. That should be sufficient even for a
reasonably loaded machine :)

I also replaced the commented out "System.err.println"s with the debug
util call that's used in these tests to keep things uniform, although
such calls should probably in time be replaced by the normal
BaseTestCase.println (outside scope here).
  
> Instability in some replication tests under load, since tests don't wait long enough
for final state or anticipate intermediate states
> --------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: DERBY-4175
>                 URL: https://issues.apache.org/jira/browse/DERBY-4175
>             Project: Derby
>          Issue Type: Bug
>          Components: Regression Test Failure, Replication
>         Environment: Solaris 2008.11. snv_111 (x86) on trunk.
>            Reporter: Dag H. Wanvik
>            Assignee: Dag H. Wanvik
>            Priority: Minor
>         Attachments: derby-4175-2.diff, derby-4175-3.diff, derby-4175-3.stat, derby-4175.diff,
derby-4175.stat
>
>
> The test expects REPLICATION_DB_NOT_BOOTED (XRE11), but sees
> REPLICATION_SLAVE_SHUTDOWN_OK (XRE 42):
> 1) testReplication_Local_StateTest_part1_1(org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_StateTest_part1_1)junit.framework.AssertionFailedError:
jdbc:derby://localhost:4527//export/home/dag/java/sb/tests/derby-3417a-replicationTests.ReplicationSuite-sb4.jars.sane-1.6.0_13-21079/db_slave/wombat;stopSlave=true
failed: -1 XRE42 DERBY SQL error: SQLCODE: -1, SQLSTATE: XRE42, SQLERRMC: /export/home/dag/java/sb/tests/derby-3417a-replicationTests.ReplicationSuite-sb4.jars.sane-1.6.0_13-21079/db_slave/wombat^TXRE42
> 	at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_StateTest_part1_1._testPostStartedMasterAndSlave_StopSlave(ReplicationRun_Local_StateTest_part1_1.java:226)
> 	at org.apache.derbyTesting.functionTests.tests.replicationTests.ReplicationRun_Local_StateTest_part1_1.testReplication_Local_StateTest_part1_1(ReplicationRun_Local_StateTest_part1_1.java:130)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at org.apache.derbyTesting.junit.BaseTestCase.runBare(BaseTestCase.java:105)
> 	at junit.extensions.TestDecorator.basicRun(TestDecorator.java:24)
> 	at junit.extensions.TestSetup$1.protect(TestSetup.java:21)
> 	at junit.extensions.TestSetup.run(TestSetup.java:25)
> I think this is a race condition: when the slave receives a message to
> shut down (this is what happens here when the server is master's
> server is shut down) it takes some time for this to happen, and in the
> meantime a stopSlave on the slave will get
> REPLICATION_SLAVE_SHUTDOWN_OK.
> In the code, there is a sleep just ahead of the failing stopSlave to
> avoid this scenario:
>         // Take down master - slave connection:
>         killMaster(masterServerHost, masterServerPort);
>         Thread.sleep(5000L); // TEMPORARY to see if slave sees that master is gone!
> and I guess on my laptop, the 5 seconds was not enough. I think it
> would be better to accept both states here as acceptable, than make
> the test brittle. If this is a bug - that we sometimes see
> REPLICATION_SLAVE_SHUTDOWN_OK - and it may well be, since ahead of the
> master stop, we would see SLAVE_OPERATION_DENIED_WHILE_CONNECTED
> (XRE41), I think - then this should be logged as a separate issue.
> In contrast, I think that if connection to the master is *lost*, a
> stopSlave on slave would see REPLICATION_SLAVE_SHUTDOWN_OK as the
> normal response.
> (edit 2009.04.24 by dagw): It turns out that a similar problem is present in
> ReplicationRun_Local_StateTest_part1_2, but there the symptom is that
> a connect succeeds, but is expected to fail. It does should fail
> eventually if given enough time . The error expected is 08004.C.7 in
> this case (CANNOT_CONNECT_TO_DB_IN_SLAVE_MODE).'

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message