lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shalin Shekhar Mangar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-6235) SyncSliceTest fails on jenkins with no live servers available error
Date Thu, 10 Jul 2014 13:42:04 GMT

    [ https://issues.apache.org/jira/browse/SOLR-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14057467#comment-14057467
] 

Shalin Shekhar Mangar commented on SOLR-6235:
---------------------------------------------

Wow, crazy crazy bug! I finally found the root cause.

The problem is with the leader initiated replica code which uses core name to set/get status.
This works fine as long as the core names for all nodes are different but if they all happened
to be "collection1" then we have this problem  :)

In this particular failure that I investigated:
http://jenkins.thetaphi.de/job/Lucene-Solr-4.x-MacOSX/1667/consoleText

Here's the sequence of events:
# port:51916 - core_node1 was initially the leader, docs were indexed and then it was killed
# port:51919 - core_node2 became the leader, peer sync happened, shards were checked for consistency
# port:51916 - core_node1 was brought back online, it recovered from the leader, consistency
check passed
# port:51923 core_node3 and port:51932 core_node4 were added to the skipped servers
# 300 docs were indexed (to go beyond the peer sync limit)
# port:51919 - core_node2 (the leader was killed)

Here is where things get interesting:
# port:51923 core_node3 tries to become the leader and initiates sync with other replicas
# In the meanwhile, a commit request from checkShardConsistency makes its way to port:51923
core_node3 (even though it's not clear whether it has indeed become the leader)
# port:51923 core_node3 calls commit on all shards including port:51919 core_node2 which should've
been down but perhaps the local state at 51923 is not updated yet?
# port:51923 core_node3 puts replica collection1 on 127.0.0.1:51919_ into leader-initiated
recovery
# port:51923 - core_node3 fails to peersync (because number of changes were too large) and
rejoins election
# After this point each shard that tries to become the leader fails because it thinks that
it has been put under leader initiated recovery and goes into actual "recovery"
# Of course, since there is no leader, recovery cannot happen and each shard eventually goes
to "recovery_failed" state
# Eventually the test gives up and throws an error saying that there are no live server available
to handle the request.

> SyncSliceTest fails on jenkins with no live servers available error
> -------------------------------------------------------------------
>
>                 Key: SOLR-6235
>                 URL: https://issues.apache.org/jira/browse/SOLR-6235
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud, Tests
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 4.10
>
>
> {code}
> 1 tests failed.
> FAILED:  org.apache.solr.cloud.SyncSliceTest.testDistribSearch
> Error Message:
> No live SolrServers available to handle this request
> Stack Trace:
> org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle
this request
>         at __randomizedtesting.SeedInfo.seed([685C57B3F25C854B:E9BAD9AB8503E577]:0)
>         at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:317)
>         at org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:659)
>         at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:91)
>         at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301)
>         at org.apache.solr.cloud.AbstractFullDistribZkTestBase.checkShardConsistency(AbstractFullDistribZkTestBase.java:1149)
>         at org.apache.solr.cloud.AbstractFullDistribZkTestBase.checkShardConsistency(AbstractFullDistribZkTestBase.java:1118)
>         at org.apache.solr.cloud.SyncSliceTest.doTest(SyncSliceTest.java:236)
>         at org.apache.solr.BaseDistributedSearchTestCase.testDistribSearch(BaseDistributedSearchTestCase.java:865)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message