lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suril Shah (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-13532) Unable to start core recovery due to timeout in ping request
Date Thu, 20 Jun 2019 17:56:00 GMT

    [ https://issues.apache.org/jira/browse/SOLR-13532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16868807#comment-16868807
] 

Suril Shah commented on SOLR-13532:
-----------------------------------

[~caomanhdat]: I will send in a patch today.

> Unable to start core recovery due to timeout in ping request
> ------------------------------------------------------------
>
>                 Key: SOLR-13532
>                 URL: https://issues.apache.org/jira/browse/SOLR-13532
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 7.6
>            Reporter: Suril Shah
>            Priority: Major
>
> Discovered following issue with the core recovery:
>  * Core recovery is not being initialized and throwing following exception message :
> {code:java}
> 2019-06-07 00:53:12.436 INFO  (recoveryExecutor-4-thread-1-processing-n:<solr_ip>:8983_solr
x:<collection_name>_shard41_replica_n2777 c:<collection_name> s:shard41 r:core_node2778)
x:<collection_name>_shard41_replica_n2777 o.a.s.c.RecoveryStrategy Failed to connect
leader http://<solr_ip>:8983/solr on recovery, try again{code}
>  * Above error occurs when ping request takes time more than a timeout period which
is hard-coded to one second in solr source code. However In a general production setting it
is common to have ping time more than one second, hence, the core recovery never starts and
exception is thrown.
>  * Also the other major concern is that this exception is logged as an info message,
hence it is very difficult to identify the error if info logging is not enabled.
>  * Please refer to following code snippet from the [source code|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L789-L803] to
understand the above issue.
> {code:java}
>       try (HttpSolrClient httpSolrClient = new HttpSolrClient.Builder(leaderReplica.getCoreUrl())
>           .withSocketTimeout(1000)
>           .withConnectionTimeout(1000)
>           .withHttpClient(cc.getUpdateShardHandler().getRecoveryOnlyHttpClient())
>           .build()) {
>         SolrPingResponse resp = httpSolrClient.ping();
>         return leaderReplica;
>       } catch (IOException e) {
>         log.info("Failed to connect leader {} on recovery, try again", leaderReplica.getBaseUrl());
>         Thread.sleep(500);
>       } catch (Exception e) {
>         if (e.getCause() instanceof IOException) {
>           log.info("Failed to connect leader {} on recovery, try again", leaderReplica.getBaseUrl());
>           Thread.sleep(500);
>         } else {
>           return leaderReplica;
>         }
>       }
> {code}
> The above issue will have high impact in production level clusters, since cores not
being able to recover may lead to data loss.
> Following improvements would be really helpful:
>  1. The [timeout for ping request|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L790-L791]
in *RecoveryStrategy.java* should be configurable and the defaults set to high values like
15seconds.
>  2. The exception message in [line 797|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L797]
and [line 801|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L801]
in *RecoveryStrategy.java* should be logged as *error* messages instead of *info* messages



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message