lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shalin Shekhar Mangar (JIRA)" <>
Subject [jira] [Commented] (SOLR-6235) SyncSliceTest fails on jenkins with no live servers available error
Date Thu, 10 Jul 2014 13:42:04 GMT


Shalin Shekhar Mangar commented on SOLR-6235:

Wow, crazy crazy bug! I finally found the root cause.

The problem is with the leader initiated replica code which uses core name to set/get status.
This works fine as long as the core names for all nodes are different but if they all happened
to be "collection1" then we have this problem  :)

In this particular failure that I investigated:

Here's the sequence of events:
# port:51916 - core_node1 was initially the leader, docs were indexed and then it was killed
# port:51919 - core_node2 became the leader, peer sync happened, shards were checked for consistency
# port:51916 - core_node1 was brought back online, it recovered from the leader, consistency
check passed
# port:51923 core_node3 and port:51932 core_node4 were added to the skipped servers
# 300 docs were indexed (to go beyond the peer sync limit)
# port:51919 - core_node2 (the leader was killed)

Here is where things get interesting:
# port:51923 core_node3 tries to become the leader and initiates sync with other replicas
# In the meanwhile, a commit request from checkShardConsistency makes its way to port:51923
core_node3 (even though it's not clear whether it has indeed become the leader)
# port:51923 core_node3 calls commit on all shards including port:51919 core_node2 which should've
been down but perhaps the local state at 51923 is not updated yet?
# port:51923 core_node3 puts replica collection1 on into leader-initiated
# port:51923 - core_node3 fails to peersync (because number of changes were too large) and
rejoins election
# After this point each shard that tries to become the leader fails because it thinks that
it has been put under leader initiated recovery and goes into actual "recovery"
# Of course, since there is no leader, recovery cannot happen and each shard eventually goes
to "recovery_failed" state
# Eventually the test gives up and throws an error saying that there are no live server available
to handle the request.

> SyncSliceTest fails on jenkins with no live servers available error
> -------------------------------------------------------------------
>                 Key: SOLR-6235
>                 URL:
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud, Tests
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 4.10
> {code}
> 1 tests failed.
> Error Message:
> No live SolrServers available to handle this request
> Stack Trace:
> org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle
this request
>         at __randomizedtesting.SeedInfo.seed([685C57B3F25C854B:E9BAD9AB8503E577]:0)
>         at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(
>         at org.apache.solr.client.solrj.impl.CloudSolrServer.request(
>         at org.apache.solr.client.solrj.request.QueryRequest.process(
>         at org.apache.solr.client.solrj.SolrServer.query(
>         at
>         at
>         at
>         at org.apache.solr.BaseDistributedSearchTestCase.testDistribSearch(
> {code}

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message