lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Petetin Ludovic <ludovic.pete...@kelkoo.com>
Subject Re: [Solr-5-4-1] Why SolrCloud leader is putting all replicas in recovery at the same time ?
Date Fri, 07 Oct 2016 08:18:43 GMT
Hi Erick,

thanks for taking time to answer.

What we are looking for here is not the root cause but a mechanism to prevent the leader from
doing this silly thing of putting all the replicas in recovery at the same time which inevitably
leads to a downtime.

Is there a similar conf in Solr as the conf below in Elastic Search which guarantees that
we have always enough nodes up to serve the traffic ?

gateway.recover_after_nodes
Recover as long as this many data or master nodes have joined the cluster.
To answer your questions as well, when the problem occured, we had :
- no traffic peak
- no JVM memory issue
- no system memory issue
- no network issue
- no disk space issue
- no Zookeepers issue

The impacted cluster is composed of 6 DELL R420 servers with 96GB of RAM, with an index of
10M of documents serving our traffic (400 queries per second) in France.
We have several other Solr clusters in the same datacenter sharing the same Zookeepers but
serving other European and American countries which were not impacted when the issue occured.

I paste below the logs of the leader at the moment of the issue and the logs of the replica
which is put in recovery in this paste (solr-57).

1. Log of the leader
2016-10-04 18:18:46,326 [searcherExecutor-87-thread-1-processing-s:shard1 x:fr_blue c:fr_blue
n:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue]
INFO  org.apache.solr.core.QuerySenderListener:newSearcher:49  - QuerySenderListener sending
requests to Searcher@43a597cb[fr_blue] main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_gxjd(5.4.1):C3804213/811353:delGen=2)
Uninverting(_gwmj(5.4.1):C3449095/1848374:delGen=8) Uninverting(_gxkn(5.4.1):C2609089/459163:delGen=3)
Uninverting(_gxs7(5.4.1):c669028/10913:delGen=1) Uninverting(_gxxf(5.4.1):c454860/104356:delGen=1)
Uninverting(_gy55(5.4.1):c373416/59027) Uninverting(_gybq(5.4.1):c222631/1631) Uninverting(_gy6c(5.4.1):c29736/591:delGen=1)
Uninverting(_gy76(5.4.1):C2020/45:delGen=2) Uninverting(_gy74(5.4.1):C1909/40:delGen=2) Uninverting(_gyhk(5.4.1):c192078/656)
Uninverting(_gye4(5.4.1):c66133/977) Uninverting(_gyio(5.4.1):c43028/274) Uninverting(_gyjk(5.4.1):c48841/276)
Uninverting(_gykl(5.4.1):c42989/101) Uninverting(_gylf(5.4.1):c31401/50) Uninverting(_gylt(5.4.1):c5802/6)
Uninverting(_gyln(5.4.1):c4119/1) Uninverting(_gyn1(5.4.1):c36015/68) Uninverting(_gyo6(5.4.1):c27906/18)
Uninverting(_gyox(5.4.1):c23494/4) Uninverting(_gyp3(5.4.1):c3686) Uninverting(_gyov(5.4.1):C1124)
Uninverting(_gyot(5.4.1):C1224/1:delGen=1) Uninverting(_gyp0(5.4.1):C514/1:delGen=1) Uninverting(_gyoy(5.4.1):C573/1:delGen=1)
Uninverting(_gyp1(5.4.1):C120/1:delGen=1) Uninverting(_gyou(5.4.1):C747/1:delGen=1) Uninverting(_gyor(5.4.1):C467)
Uninverting(_gyp2(5.4.1):C101) Uninverting(_gyoz(5.4.1):C250)))}
2016-10-04 18:18:46,326 [searcherExecutor-87-thread-1-processing-s:shard1 x:fr_blue c:fr_blue
n:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue]
INFO  org.apache.solr.core.QuerySenderListener:newSearcher:96  - QuerySenderListener done.
2016-10-04 18:25:55,647 [commitScheduler-89-thread-1] INFO  org.apache.solr.core.SolrDeletionPolicy:onCommit:99
 - SolrDeletionPolicy.onCommit: commits: num=2
        commit{dir=/opt/kookel/data/searchSolrNode/solrindex/fr1_blue/index.20160731234345634,segFN=segments_7ma,generation=9874}
        commit{dir=/opt/kookel/data/searchSolrNode/solrindex/fr1_blue/index.20160731234345634,segFN=segments_7mb,generation=9875}
2016-10-04 18:25:55,650 [commitScheduler-89-thread-1] INFO  org.apache.solr.core.SolrDeletionPolicy:updateCommits:166
 - newest commit generation = 9875
2016-10-04 18:25:56,010 [commitScheduler-89-thread-1] INFO  org.apache.solr.search.SolrIndexSearcher:<init>:237
 - Opening Searcher@39da46db[fr_blue] realtime
2016-10-04 18:40:56,195 [commitScheduler-89-thread-1] INFO  org.apache.solr.core.SolrDeletionPolicy:onCommit:99
 - SolrDeletionPolicy.onCommit: commits: num=2
        commit{dir=/opt/kookel/data/searchSolrNode/solrindex/fr1_blue/index.20160731234345634,segFN=segments_7mb,generation=9875}
        commit{dir=/opt/kookel/data/searchSolrNode/solrindex/fr1_blue/index.20160731234345634,segFN=segments_7mc,generation=9876}
2016-10-04 18:40:56,196 [commitScheduler-89-thread-1] INFO  org.apache.solr.core.SolrDeletionPolicy:updateCommits:166
 - newest commit generation = 9876
2016-10-04 18:40:56,387 [commitScheduler-89-thread-1] INFO  org.apache.solr.search.SolrIndexSearcher:<init>:237
 - Opening Searcher@7ce3335c[fr_blue] realtime
2016-10-04 18:43:39,426 [updateExecutor-2-thread-105261-processing-http:////dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080//searchsolrnodefr//fr_blue
s:shard1 x:fr_blue c:fr_blue n:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr
r:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue] ERROR org.apache.solr.update.StreamingSolrClients:handleError:78
 - error
java.net.SocketException: Connection reset
        at java.net.SocketInputStream.read(SocketInputStream.java:196)
        at java.net.SocketInputStream.read(SocketInputStream.java:122)
        at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
        at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
        at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
        at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
        at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
        at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
        at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
        at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
        at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
        at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
        at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
        at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
        at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:279)
        at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:160)
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:232)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
2016-10-04 18:43:39,430 [http-8080-86] WARN  org.apache.solr.update.processor.DistributedUpdateProcessor:doFinish:792
 - Error sending update to http://dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080/searchsolrnodefr
java.net.SocketException: Connection reset
        at java.net.SocketInputStream.read(SocketInputStream.java:196)
        at java.net.SocketInputStream.read(SocketInputStream.java:122)
        at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
        at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
        at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
        at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
        at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
        at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
        at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
        at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
        at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
        at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
        at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
        at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
        at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:279)
        at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:160)
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:232)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
2016-10-04 18:43:39,430 [http-8080-86] ERROR org.apache.solr.update.processor.DistributedUpdateProcessor:doFinish:886
 - Setting up to try to start recovery on replica http://dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080/searchsolrnodefr/fr_blue/
java.net.SocketException: Connection reset
        at java.net.SocketInputStream.read(SocketInputStream.java:196)
        at java.net.SocketInputStream.read(SocketInputStream.java:122)
        at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
        at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
        at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
        at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
        at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
        at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
        at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
        at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
        at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
        at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
        at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
        at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
        at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.sendUpdateStream(ConcurrentUpdateSolrClient.java:279)
        at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient$Runner.run(ConcurrentUpdateSolrClient.java:160)
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:232)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
2016-10-04 18:43:39,431 [http-8080-86] INFO  org.apache.solr.cloud.ZkController:ensureReplicaInLeaderInitiatedRecovery:2054
 - Put replica core=fr_blue coreNodeName=dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue
on dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr into leader-initiated recovery.
2016-10-04 18:43:39,432 [updateExecutor-2-thread-105288-processing-s:shard1 http:////dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080//searchsolrnodefr//fr_blue//
x:fr_blue c:fr_blue n:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue]
INFO  org.apache.solr.common.cloud.SolrZkClient:makePath:472  - makePath: /collections/fr_blue/leader_initiated_recovery/shard1
2016-10-04 18:43:39,441 [updateExecutor-2-thread-105288-processing-s:shard1 http:////dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080//searchsolrnodefr//fr_blue//
x:fr_blue c:fr_blue n:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue]
INFO  org.apache.solr.cloud.ZkController:updateLeaderInitiatedRecoveryState:2194  - Wrote
down to /collections/fr_blue/leader_initiated_recovery/shard1/dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue
2016-10-04 18:43:39,442 [updateExecutor-2-thread-105288-processing-s:shard1 http:////dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080//searchsolrnodefr//fr_blue//
x:fr_blue c:fr_blue n:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue]
INFO  org.apache.solr.cloud.LeaderInitiatedRecoveryThread:publishDownState:123  - Put replica
core=fr_blue coreNodeName=dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue
on dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr into leader-initiated recovery.
2016-10-04 18:43:39,442 [updateExecutor-2-thread-105288-processing-s:shard1 http:////dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080//searchsolrnodefr//fr_blue//
x:fr_blue c:fr_blue n:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue]
WARN  org.apache.solr.cloud.LeaderInitiatedRecoveryThread:publishDownState:158  - Leader is
publishing core=fr_blue coreNodeName =dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue
state=down on behalf of un-reachable replica http://dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080/searchsolrnodefr/fr_blue/
2016-10-04 18:43:39,448 [updateExecutor-2-thread-105288-processing-s:shard1 http:////dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080//searchsolrnodefr//fr_blue//
x:fr_blue c:fr_blue n:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue]
INFO  org.apache.solr.cloud.LeaderInitiatedRecoveryThread:sendRecoveryCommandWithRetry:190
 - LeaderInitiatedRecoveryThread-fr_blue started running to send REQUESTRECOVERY command to
http://dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080/searchsolrnodefr/fr_blue/; will try for
a max of 600 secs
2016-10-04 18:43:39,449 [updateExecutor-2-thread-105288-processing-s:shard1 http:////dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080//searchsolrnodefr//fr_blue//
x:fr_blue c:fr_blue n:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue]
INFO  org.apache.solr.cloud.LeaderInitiatedRecoveryThread:sendRecoveryCommandWithRetry:202
 - Asking core=fr_blue coreNodeName=dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue
on http://dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080/searchsolrnodefr to recover
2016-10-04 18:43:39,450 [OverseerStateUpdate-1970092343297114116-dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr-n_0000001954]
INFO  org.apache.solr.cloud.Overseer:run:213  - processMessage: queueSize: 1, message = {
  "operation":"state",
  "state":"down",
  "base_url":"http://dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080/searchsolrnodefr"<http://dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080/searchsolrnodefr>,
  "core":"fr_blue",
  "node_name":"dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr",
  "shard":"shard1",
  "collection":"fr_blue"} current state version: 30797
2016-10-04 18:43:39,453 [OverseerStateUpdate-1970092343297114116-dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr-n_0000001954]
INFO  org.apache.solr.cloud.overseer.ReplicaMutator:updateState:211  - Update state numShards=null
message={
  "operation":"state",
  "state":"down",
  "base_url":"http://dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080/searchsolrnodefr"<http://dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080/searchsolrnodefr>,
  "core":"fr_blue",
  "node_name":"dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr",
  "shard":"shard1",
  "collection":"fr_blue"}
2016-10-04 18:43:39,453 [OverseerStateUpdate-1970092343297114116-dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr-n_0000001954]
INFO  org.apache.solr.cloud.overseer.ReplicaMutator:updateState:239  - node=dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue
is already registered
2016-10-04 18:43:39,457 [updateExecutor-2-thread-105288-processing-s:shard1 http:////dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080//searchsolrnodefr//fr_blue//
x:fr_blue c:fr_blue n:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue]
INFO  org.apache.solr.cloud.LeaderInitiatedRecoveryThread:sendRecoveryCommandWithRetry:211
 - Successfully sent REQUESTRECOVERY command to core=fr_blue coreNodeName=dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue
on http://dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080/searchsolrnodefr
2016-10-04 18:43:39,460 [updateExecutor-2-thread-105288-processing-s:shard1 http:////dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080//searchsolrnodefr//fr_blue//
x:fr_blue c:fr_blue n:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue]
INFO  org.apache.solr.cloud.LeaderInitiatedRecoveryThread:run:111  - LeaderInitiatedRecoveryThread-fr_blue
completed successfully after running for 29.0ms
2016-10-04 18:43:39,466 [zkCallback-4-thread-118-processing-n:dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr]
INFO  org.apache.solr.common.cloud.ZkStateReader:process:915  - A cluster state change: WatchedEvent
state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating...
(live nodes size: 6)
2016-10-04 18:43:39,516 [OverseerStateUpdate-1970092343297114116-dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr-n_0000001954]
INFO  org.apache.solr.cloud.Overseer:run:213  - processMessage: queueSize: 1, message = {
  "operation":"state",
  "core_node_name":"dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue",
  "numShards":"1",
  "shard":"shard1",
  "roles":null,
  "state":"recovering",
  "core":"fr_blue",
  "collection":"fr_blue",
  "node_name":"dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr",
  "base_url":"http://dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080/searchsolrnodefr"<http://dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080/searchsolrnodefr>}
current state version: 30798
2016-10-04 18:43:39,518 [OverseerStateUpdate-1970092343297114116-dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080_searchsolrnodefr-n_0000001954]
INFO  org.apache.solr.cloud.overseer.ReplicaMutator:updateState:211  - Update state numShards=1
message={
  "operation":"state",
  "core_node_name":"dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue",
  "numShards":"1",
  "shard":"shard1",
  "roles":null,
  "state":"recovering",
  "core":"fr_blue",
  "collection":"fr_blue",
  "node_name":"dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr",
  "base_url":"http://dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080/searchsolrnodefr"<http://dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080/searchsolrnodefr>}


2. Log the replica (solr-57) which is put in recovery
2016-10-04 18:40:57,014 [commitScheduler-95-thread-1] INFO  org.apache.solr.core.SolrDeletionPolicy:onCommit:99
 - SolrDeletionPolicy.onCommit: commits: num=2
        commit{dir=/opt/kookel/data/searchSolrNode/solrindex/fr1_blue/index.20161002014159446,segFN=segments_7mb,generation=9875}
        commit{dir=/opt/kookel/data/searchSolrNode/solrindex/fr1_blue/index.20161002014159446,segFN=segments_7mc,generation=9876}
2016-10-04 18:40:57,014 [commitScheduler-95-thread-1] INFO  org.apache.solr.core.SolrDeletionPolicy:updateCommits:166
 - newest commit generation = 9876
2016-10-04 18:40:57,293 [commitScheduler-95-thread-1] INFO  org.apache.solr.search.SolrIndexSearcher:<init>:237
 - Opening Searcher@6d09e525[fr_blue] realtime
2016-10-04 18:43:39,453 [http-8080-92] INFO  org.apache.solr.handler.admin.CoreAdminHandler:handleRequestRecoveryAction:812
 - It has been requested that we recover: core=fr_blue
2016-10-04 18:43:39,453 [Thread-30289] INFO  org.apache.solr.cloud.ActionThrottle:minimumWaitBetweenActions:72
 - The last recovery attempt started 234111418ms ago.
2016-10-04 18:43:39,456 [updateExecutor-2-thread-10-processing-s:shard1 x:fr_blue c:fr_blue
n:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue]
INFO  org.apache.solr.cloud.RecoveryStrategy:run:223  - Starting recovery process. recoveringAfterStartup=false
2016-10-04 18:43:39,456 [http-8080-92] INFO  org.apache.solr.servlet.HttpSolrCall:handleAdminRequest:667
 - [admin] webapp=null path=/admin/cores params={action=REQUESTRECOVERY&core=fr_blue&wt=javabin&version=2}
status=0 QTime=3
2016-10-04 18:43:39,457 [updateExecutor-2-thread-10-processing-s:shard1 x:fr_blue c:fr_blue
n:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue]
INFO  org.apache.solr.cloud.RecoveryStrategy:doRecovery:342  - Begin buffering updates. core=fr_blue
2016-10-04 18:43:39,466 [zkCallback-4-thread-84-processing-n:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr]
INFO  org.apache.solr.common.cloud.ZkStateReader:process:915  - A cluster state change: WatchedEvent
state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating...
(live nodes size: 6)
2016-10-04 18:43:39,505 [updateExecutor-2-thread-10-processing-s:shard1 x:fr_blue c:fr_blue
n:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue]
INFO  org.apache.solr.cloud.RecoveryStrategy:doRecovery:346  - Publishing state of core fr_blue
as recovering, leader is http://dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080/searchsolrnodefr/fr_blue/
and I am http://dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080/searchsolrnodefr/fr_blue/
2016-10-04 18:43:39,505 [updateExecutor-2-thread-10-processing-s:shard1 x:fr_blue c:fr_blue
n:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue]
INFO  org.apache.solr.cloud.ZkController:publish:1199  - publishing state=recovering
2016-10-04 18:43:39,505 [updateExecutor-2-thread-10-processing-s:shard1 x:fr_blue c:fr_blue
n:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue]
INFO  org.apache.solr.cloud.ZkController:publish:1203  - numShards not found on descriptor
- reading it from system property
2016-10-04 18:43:39,508 [updateExecutor-2-thread-10-processing-s:shard1 x:fr_blue c:fr_blue
n:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue]
INFO  org.apache.solr.cloud.ZkController:updateLeaderInitiatedRecoveryState:2194  - Wrote
recovering to /collections/fr_blue/leader_initiated_recovery/shard1/dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue
2016-10-04 18:43:39,513 [updateExecutor-2-thread-10-processing-s:shard1 x:fr_blue c:fr_blue
n:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr r:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue]
INFO  org.apache.solr.cloud.RecoveryStrategy:sendPrepRecoveryCmd:605  - Sending prep recovery
command to http://dc1-s6-prod-solr-52.prod.dc1.kelkoo.net:8080/searchsolrnodefr; WaitForState:
action=PREPRECOVERY&core=fr_blue&nodeName=dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr&coreNodeName=dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr_fr_blue&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true
2016-10-04 18:43:39,627 [zkCallback-4-thread-84-processing-n:dc1-s6-prod-solr-57.prod.dc1.kelkoo.net:8080_searchsolrnodefr]
INFO  org.apache.solr.common.cloud.ZkStateReader:process:915  - A cluster state change: WatchedEvent
state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating...
(live nodes size: 6)

Thanks,

Ludo



-------- Forwarded Message --------
Subject:        Re: [Solr-5-4-1] Why SolrCloud leader is putting all replicas in recovery
at the same time ?
Date:   Thu, 6 Oct 2016 16:00:01 +0200
From:   Erick Erickson <erickerickson@gmail.com><mailto:erickerickson@gmail.com>
Reply-To:       solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org> <solr-user@lucene.apache.org><mailto:solr-user@lucene.apache.org>
To:     solr-user <solr-user@lucene.apache.org><mailto:solr-user@lucene.apache.org>


There is no information here at all that would us to say anything meaningful. You might review:
http://wiki.apache.org/solr/UsingMailingLists

What do the logs say? Are there any exceptions? What happens on your system that's
unusual if anything? In short, what have you tried to do to diagnose the cause and what
have you learned?

But here's a random guess. You didn't configure your log4j properties and your logs, particularly
your console log have grown to huge sizes and occasionally you encounter disk full issues.
Or
your ZK nodes have the same problem, they accumulate snapshots (see the Zookeeper admin page).

Best,
Erick

On Thu, Oct 6, 2016 at 2:23 AM, Gerald Reinhart <gerald.reinhart@kelkoo.com<mailto:gerald.reinhart@kelkoo.com>>
wrote:

Hello everyone,

    Our Solr Cloud  works very well for several months without any significant changes: the
traffic to serve is stable, no major release deployed...

    But randomly, the Solr Cloud leader puts all the replicas in recovery at the same time
for no obvious reason.

    Hence, we can not serve the queries any more and the leader is overloaded while replicating
all the indexes on the replicas at the same time which eventually implies a downtime of approximately
30 minutes.

    Is there a way to prevent it ? Ideally, a configuration saying a percentage of replicas
to be put in recovery at the same time?

Thanks,

Gérald, Elodie and Ludovic


________________________________
Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 158 Ter Rue du Temple 75003 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive
de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire
et d'en avertir l'expéditeur.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message