lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Obernberger <joseph.obernber...@gmail.com>
Subject Re: Solr 6.3.0 - recovery failed
Date Wed, 01 Feb 2017 18:41:11 GMT
Thank you.  I do not see any caused block in the solr.log.

-----------

2017-02-01 18:37:57.566 INFO 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Replay not 
started, or was not successful... still buffering updates.
2017-02-01 18:37:57.566 ERROR 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Recovery 
failed - trying again... (50)
2017-02-01 18:37:57.566 INFO 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Wait [12.0] 
seconds before trying to recover again (attempt=51)
2017-02-01 18:38:57.567 INFO 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Begin 
buffering updates. core=[Worldline2New_shard22_replica2]
2017-02-01 18:38:57.567 INFO 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.u.UpdateLog Restarting 
buffering. previous=RecoveryInfo{adds=0 deletes=0 deleteByQuery=0 
errors=0 positionOfStart=0}
2017-02-01 18:38:57.567 INFO 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.u.UpdateLog Starting to buffer 
updates. HDFSUpdateLog{state=BUFFERING, tlog=null}
2017-02-01 18:38:57.567 INFO 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Publishing 
state of core [Worldline2New_shard22_replica2] as recovering, leader is 
[http://cordelia:9100/solr/Worldline2New_shard22_replica1/] and I am 
[http://bilbo:9100/solr/Worldline2New_shard22_replica2/]
2017-02-01 18:38:57.586 INFO 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Sending prep 
recovery command to [http://cordelia:9100/solr]; [WaitForState: 
action=PREPRECOVERY&core=Worldline2New_shard22_replica1&nodeName=bilbo:9100_solr&coreNodeName=core_node34&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true]
2017-02-01 18:38:57.644 INFO 
(zkCallback-5-thread-49-processing-n:bilbo:9100_solr) [   ] 
o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent 
state:SyncConnected type:NodeDataChanged 
path:/collections/Worldline2New/state.json] for collection 
[Worldline2New] has occurred - updating... (live nodes size: [9])
2017-02-01 18:39:04.594 INFO 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Starting 
Replication Recovery.
2017-02-01 18:39:04.594 INFO 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Attempting to 
replicate from [http://cordelia:9100/solr/Worldline2New_shard22_replica1/].
2017-02-01 18:39:04.604 INFO 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.h.IndexFetcher Master's 
generation: 12398
2017-02-01 18:39:04.612 INFO 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.h.IndexFetcher Master's version: 
1485965089535
2017-02-01 18:39:04.612 INFO 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.h.IndexFetcher Slave's 
generation: 12358
2017-02-01 18:39:04.612 INFO 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.h.IndexFetcher Starting 
replication process
2017-02-01 18:39:04.990 ERROR 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Error while 
trying to recover:org.apache.solr.common.SolrException: Replication for 
recovery failed.
         at 
org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:159)
         at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:408)
         at 
org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:221)
         at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
         at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
         at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
         at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
         at java.lang.Thread.run(Thread.java:745)

2017-02-01 18:39:04.990 INFO 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Replay not 
started, or was not successful... still buffering updates.
2017-02-01 18:39:04.990 ERROR 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Recovery 
failed - trying again... (51)
2017-02-01 18:39:04.990 INFO 
(recoveryExecutor-3-thread-8-processing-n:bilbo:9100_solr 
x:Worldline2New_shard22_replica2 s:shard22 c:Worldline2New 
r:core_node34) [c:Worldline2New s:shard22 r:core_node34 
x:Worldline2New_shard22_replica2] o.a.s.c.RecoveryStrategy Wait [12.0] 
seconds before trying to recover again (attempt=52)

-----------------

-Joe


On 2/1/2017 1:37 PM, Alessandro Benedetti wrote:
> I can't debug the code  now,  but if you access the logs,  directly ( not
> from the ui),  is there any " caused by"  associated to the recovery
> failure exception?
> Cheers
>
> On 1 Feb 2017 6:28 p.m., "Joe Obernberger" <joseph.obernberger@gmail.com>
> wrote:
>
>> In HDFS when a node fails it will leave behind write.lock files in HDFS.
>> These files have to be manually removed; otherwise the shards/replicas that
>> have write.lock files left behind will not start.  Since I can't tell which
>> physical node is hosting which shard/replica, I stop all the nodes, delete
>> all the write.lock files in HDFS and restart.
>>
>> You are correct - only one replica is failing to start.  The other
>> replicas on the same physical node are coming up OK. Picture is worth a
>> thousand words so:
>> http://lovehorsepower.com/images/Cluster1.jpg
>>
>> Errors:
>> http://lovehorsepower.com/images/ClusterSolr2.jpg
>>
>> -Joe
>>
>> On 2/1/2017 1:20 PM, Alessandro Benedetti wrote:
>>
>>> Ok,  it is clearer now.
>>> You have 9 solr nodes running,  one per physical machine.
>>> So each node has a number cores ( both replicas and leaders).
>>> When the node died,  you got a lot of indexes corrupted.
>>> I still miss why you restarted the others 8 working nodes ( I was
>>> expecting
>>> you to restart only the failed one)
>>>
>>> When you mention that only one replica  is failing,  you mean that the
>>> solr
>>> node is up and running and only  one solr core ( the replica of one shard)
>>>    keeps failing?
>>> Or all the local cores in that node are failing  to recover?
>>>
>>> Cheers
>>>
>>> On 1 Feb 2017 6:07 p.m., "Joe Obernberger" <joseph.obernberger@gmail.com>
>>> wrote:
>>>
>>> Thank you for the response.
>>> There are no virtual machines in the configuration.  The collection has 45
>>> shards with 3 replicas each spread across the 9 physical boxes; each box
>>> is
>>> running one copy of solr.  I've tried to restart just the one node after
>>> the other 8 (and all their shards/replicas) came up, but this one replica
>>> seems to be in perma-recovery.
>>>
>>> Shard Count: 45
>>> replicationFactor: 3
>>> maxShardsPerNode: 50
>>> router: compositeId
>>> autoAddReplicas: false
>>>
>>> SOLR_JAVA_MEM options are -Xms16g - Xmx32g
>>>
>>> _TUNE is:
>>> "-XX:+UseG1GC \
>>> -XX:MaxDirectMemorySize=8g
>>> -XX:+PerfDisableSharedMem \
>>> -XX:+ParallelRefProcEnabled \
>>> -XX:G1HeapRegionSize=32m \
>>> -XX:MaxGCPauseMillis=500 \
>>> -XX:InitiatingHeapOccupancyPercent=75 \
>>> -XX:ParallelGCThreads=16 \
>>> -XX:+UseLargePages \
>>> -XX:-ResizePLAB \
>>> -XX:+AggressiveOpts"
>>>
>>> So far it has retried 22 times.  The cluster is accessible and OK, but I'm
>>> afraid to continue indexing data if this one node will never come back.
>>> Thanks for help!
>>>
>>> -Joe
>>>
>>>
>>>
>>> On 2/1/2017 12:58 PM, alessandro.benedetti wrote:
>>>
>>> Let me try to summarize .
>>>> How many virtual machines on top of the 9 physical ?
>>>> How many Solr processes ( replicas ? )
>>>>
>>>> If you had 1 node compromised.
>>>> I assume you have replicas as well right ?
>>>>
>>>> Can you explain a little bit better your replicas configuration ?
>>>> Why you had to stop all the nodes ?
>>>>
>>>> I would expect the stop of the solr node failing, cleanup of the index
>>>> and
>>>> restart.
>>>> Automatically it would recover from the leader.
>>>>
>>>> Something is suspicious here, let us know !
>>>>
>>>> Cheers
>>>>
>>>>
>>>>
>>>> -----
>>>> ---------------
>>>> Alessandro Benedetti
>>>> Search Consultant, R&D Software Engineer, Director
>>>> Sease Ltd. - www.sease.io
>>>> --
>>>> View this message in context: http://lucene.472066.n3.nabble
>>>> .com/Solr-6-3-0-recovery-failed-tp4318324p4318327.html
>>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>>
>>>>


Mime
View raw message