lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-11469) LeaderElectionContextKeyTest has flawed logic: 50% of the time it checks the wrong shard's elections
Date Wed, 11 Oct 2017 18:04:02 GMT

     [ https://issues.apache.org/jira/browse/SOLR-11469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hoss Man updated SOLR-11469:
----------------------------
    Attachment: SOLR-11469.patch


Here's my initial attempt at a fix, mainly focusing on...

* adding comments
* adding logging 
* renaming variables to be more explict what we're expecting
* tightening up the call to {{findLeaderReplicaWithDuplicatedName}} so we explictly look for
the leader of shard1 since that's what we assert against later
* add extra asserts that shard2 doesn't have an election either (the existing asserts only
checked the second collection)


With this fix, the test *seems* to pass a little more often -- but it's still easy to get
a diff type of failure that i was also suspicious would be very plausible given the existing
code...

The entire premise of {{findLeaderReplicaWithDuplicatedName}} is that we can find "a leader"
from collection1 with the same {{Replica.getName()}} as a Replica from collection2 -- but
IIUC there's no garuntee that will be true.

Here's an example failure with the patch applied...

{noformat}
   [junit4]   2> 8485 INFO  (TEST-LeaderElectionContextKeyTest.test-seed#[B0F9446FF638874])
[    ] o.a.s.SolrTestCaseJ4 ###Starting test
   [junit4]   2> 8486 INFO  (TEST-LeaderElectionContextKeyTest.test-seed#[B0F9446FF638874])
[    ] o.a.s.c.LeaderElectionContextKeyTest All Col1 Replicas: [core_node2:{"core":"testCollection1_shard1_replica_n1","base_url":"http://127.0.0.1:56971/solr","node_name":"127.0.0.1:56971_solr","state":"active","type":"NRT","leader":"true"},
core_node4:{"core":"testCollection1_shard2_replica_n3","base_url":"http://127.0.0.1:56971/solr","node_name":"127.0.0.1:56971_solr","state":"active","type":"NRT","leader":"true"}]
   [junit4]   2> 8486 INFO  (TEST-LeaderElectionContextKeyTest.test-seed#[B0F9446FF638874])
[    ] o.a.s.c.LeaderElectionContextKeyTest All Col2 Replicas: [core_node3:{"core":"testCollection2_shard1_replica_n1","base_url":"http://127.0.0.1:56971/solr","node_name":"127.0.0.1:56971_solr","state":"active","type":"NRT","leader":"true"},
core_node4:{"core":"testCollection2_shard2_replica_n2","base_url":"http://127.0.0.1:56971/solr","node_name":"127.0.0.1:56971_solr","state":"active","type":"NRT","leader":"true"}]
   [junit4]   2> 8488 INFO  (TEST-LeaderElectionContextKeyTest.test-seed#[B0F9446FF638874])
[    ] o.a.s.SolrTestCaseJ4 ###Ending test
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=LeaderElectionContextKeyTest
-Dtests.method=test -Dtests.seed=B0F9446FF638874 -Dtests.slow=true -Dtests.locale=ga -Dtests.timezone=Asia/Chongqing
-Dtests.asserts=true -Dtests.file.encoding=US-ASCII
   [junit4] FAILURE 0.02s | LeaderElectionContextKeyTest.test <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: Unable to find col1+shard1 leader
w/same name as replica in col2: [core_node2:{"core":"testCollection1_shard1_replica_n1","base_url":"http://127.0.0.1:56971/solr","node_name":"127.0.0.1:56971_solr","state":"active","type":"NRT","leader":"true"}]
<=?=> [core_node3:{"core":"testCollection2_shard1_replica_n1","base_url":"http://127.0.0.1:56971/solr","node_name":"127.0.0.1:56971_solr","state":"active","type":"NRT","leader":"true"},
core_node4:{"core":"testCollection2_shard2_replica_n2","base_url":"http://127.0.0.1:56971/solr","node_name":"127.0.0.1:56971_solr","state":"active","type":"NRT","leader":"true"}]
   [junit4]    > 	at __randomizedtesting.SeedInfo.seed([B0F9446FF638874:835BAB9C519FE58C]:0)
   [junit4]    > 	at org.apache.solr.cloud.LeaderElectionContextKeyTest.test(LeaderElectionContextKeyTest.java:95)
   [junit4]    > 	at java.lang.Thread.run(Thread.java:748)
{noformat}

Note:
* that seed won't reproduce reliably, because the leader node _might_ randomly have the sane
name as one of the replicas from the other collection)
* In the particular log above, if we did out testing/assertions against col1+shard2 instead
of col1+shard1 then we'd get lucky and find the coreNodeName overlap with col2 thta the test
expects -- but unless i'm missing something that's still just a fluke and not something we
can depend upon

I'm not really sure how to make this test work reliably? ... unless maybe we manually add
replicas with explicitly specified {{coreNodeName}} and force them to be the leader????


> LeaderElectionContextKeyTest has flawed logic: 50% of the time it checks the wrong shard's
elections
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-11469
>                 URL: https://issues.apache.org/jira/browse/SOLR-11469
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>         Attachments: SOLR-11469.patch
>
>
> LeaderElectionContextKeyTest is very flaky -- and on millers beastit reports it shows
a suspiciously close to "50%" failure rate.
> Digging into the test i realized that it creates a 2 shard index, then picks "a leader"
to kill (arbitrarily) and then asserts that the leader election nodes for *shard1* are affected
... so ~50% of the time it kills the shard2 leader and then fails because it doesn't see an
election in shard1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message