lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suril Shah (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-13532) Unable to start core recovery due to timeout in ping request
Date Thu, 13 Jun 2019 07:40:00 GMT

    [ https://issues.apache.org/jira/browse/SOLR-13532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16862794#comment-16862794
] 

Suril Shah commented on SOLR-13532:
-----------------------------------

[~gus_heck]: We can atleast increase the timeout values to 15000 ms as a temporary fix for
this particular issue because 1000 ms is really bad as recoveries won't happen if a ping from
the leader is not obtained in 1000 ms which could happen because of many reasons, like network
issues, low commit times, etc.
Let me know your thoughts on this.

> Unable to start core recovery due to timeout in ping request
> ------------------------------------------------------------
>
>                 Key: SOLR-13532
>                 URL: https://issues.apache.org/jira/browse/SOLR-13532
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 7.6
>            Reporter: Suril Shah
>            Priority: Major
>
> Discovered following issue with the core recovery:
>  * Core recovery is not being initialized and throwing following exception message :
> {code:java}
> 2019-06-07 00:53:12.436 INFO  (recoveryExecutor-4-thread-1-processing-n:<solr_ip>:8983_solr
x:<collection_name>_shard41_replica_n2777 c:<collection_name> s:shard41 r:core_node2778)
x:<collection_name>_shard41_replica_n2777 o.a.s.c.RecoveryStrategy Failed to connect
leader http://<solr_ip>:8983/solr on recovery, try again{code}
>  * Above error occurs when ping request takes time more than a timeout period which
is hard-coded to one second in solr source code. However In a general production setting it
is common to have ping time more than one second, hence, the core recovery never starts and
exception is thrown.
>  * Also the other major concern is that this exception is logged as an info message,
hence it is very difficult to identify the error if info logging is not enabled.
>  * Please refer to following code snippet from the [source code|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L789-L803] to
understand the above issue.
> {code:java}
>       try (HttpSolrClient httpSolrClient = new HttpSolrClient.Builder(leaderReplica.getCoreUrl())
>           .withSocketTimeout(1000)
>           .withConnectionTimeout(1000)
>           .withHttpClient(cc.getUpdateShardHandler().getRecoveryOnlyHttpClient())
>           .build()) {
>         SolrPingResponse resp = httpSolrClient.ping();
>         return leaderReplica;
>       } catch (IOException e) {
>         log.info("Failed to connect leader {} on recovery, try again", leaderReplica.getBaseUrl());
>         Thread.sleep(500);
>       } catch (Exception e) {
>         if (e.getCause() instanceof IOException) {
>           log.info("Failed to connect leader {} on recovery, try again", leaderReplica.getBaseUrl());
>           Thread.sleep(500);
>         } else {
>           return leaderReplica;
>         }
>       }
> {code}
> The above issue will have high impact in production level clusters, since cores not
being able to recover may lead to data loss.
> Following improvements would be really helpful:
>  1. The [timeout for ping request|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L790-L791]
in *RecoveryStrategy.java* should be configurable and the defaults set to high values like
15seconds.
>  2. The exception message in [line 797|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L797]
and [line 801|https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/RecoveryStrategy.java#L801]
in *RecoveryStrategy.java* should be logged as *error* messages instead of *info* messages



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message