lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Obernberger <joseph.obernber...@gmail.com>
Subject Re: Solr 6.3.0 - recovery failed
Date Wed, 01 Feb 2017 19:10:49 GMT
I brought down the whole cluster again, and brought up one server at a 
time, waiting for it to go green before launching another. Now all 
replicas are OK, including the one that was in the perma-recovery mode 
before.  I do notice a large amount of network activity (basically 
pegging the interface) when a node is brought up.  I suspect this is 
especially true since these nodes are not dataNodes in HDFS.


-Joe


On 2/1/2017 1:37 PM, Alessandro Benedetti wrote:
> I can't debug the code  now,  but if you access the logs,  directly ( not
> from the ui),  is there any " caused by"  associated to the recovery
> failure exception?
> Cheers
>
> On 1 Feb 2017 6:28 p.m., "Joe Obernberger" <joseph.obernberger@gmail.com>
> wrote:
>
>> In HDFS when a node fails it will leave behind write.lock files in HDFS.
>> These files have to be manually removed; otherwise the shards/replicas that
>> have write.lock files left behind will not start.  Since I can't tell which
>> physical node is hosting which shard/replica, I stop all the nodes, delete
>> all the write.lock files in HDFS and restart.
>>
>> You are correct - only one replica is failing to start.  The other
>> replicas on the same physical node are coming up OK. Picture is worth a
>> thousand words so:
>> http://lovehorsepower.com/images/Cluster1.jpg
>>
>> Errors:
>> http://lovehorsepower.com/images/ClusterSolr2.jpg
>>
>> -Joe
>>
>> On 2/1/2017 1:20 PM, Alessandro Benedetti wrote:
>>
>>> Ok,  it is clearer now.
>>> You have 9 solr nodes running,  one per physical machine.
>>> So each node has a number cores ( both replicas and leaders).
>>> When the node died,  you got a lot of indexes corrupted.
>>> I still miss why you restarted the others 8 working nodes ( I was
>>> expecting
>>> you to restart only the failed one)
>>>
>>> When you mention that only one replica  is failing,  you mean that the
>>> solr
>>> node is up and running and only  one solr core ( the replica of one shard)
>>>    keeps failing?
>>> Or all the local cores in that node are failing  to recover?
>>>
>>> Cheers
>>>
>>> On 1 Feb 2017 6:07 p.m., "Joe Obernberger" <joseph.obernberger@gmail.com>
>>> wrote:
>>>
>>> Thank you for the response.
>>> There are no virtual machines in the configuration.  The collection has 45
>>> shards with 3 replicas each spread across the 9 physical boxes; each box
>>> is
>>> running one copy of solr.  I've tried to restart just the one node after
>>> the other 8 (and all their shards/replicas) came up, but this one replica
>>> seems to be in perma-recovery.
>>>
>>> Shard Count: 45
>>> replicationFactor: 3
>>> maxShardsPerNode: 50
>>> router: compositeId
>>> autoAddReplicas: false
>>>
>>> SOLR_JAVA_MEM options are -Xms16g - Xmx32g
>>>
>>> _TUNE is:
>>> "-XX:+UseG1GC \
>>> -XX:MaxDirectMemorySize=8g
>>> -XX:+PerfDisableSharedMem \
>>> -XX:+ParallelRefProcEnabled \
>>> -XX:G1HeapRegionSize=32m \
>>> -XX:MaxGCPauseMillis=500 \
>>> -XX:InitiatingHeapOccupancyPercent=75 \
>>> -XX:ParallelGCThreads=16 \
>>> -XX:+UseLargePages \
>>> -XX:-ResizePLAB \
>>> -XX:+AggressiveOpts"
>>>
>>> So far it has retried 22 times.  The cluster is accessible and OK, but I'm
>>> afraid to continue indexing data if this one node will never come back.
>>> Thanks for help!
>>>
>>> -Joe
>>>
>>>
>>>
>>> On 2/1/2017 12:58 PM, alessandro.benedetti wrote:
>>>
>>> Let me try to summarize .
>>>> How many virtual machines on top of the 9 physical ?
>>>> How many Solr processes ( replicas ? )
>>>>
>>>> If you had 1 node compromised.
>>>> I assume you have replicas as well right ?
>>>>
>>>> Can you explain a little bit better your replicas configuration ?
>>>> Why you had to stop all the nodes ?
>>>>
>>>> I would expect the stop of the solr node failing, cleanup of the index
>>>> and
>>>> restart.
>>>> Automatically it would recover from the leader.
>>>>
>>>> Something is suspicious here, let us know !
>>>>
>>>> Cheers
>>>>
>>>>
>>>>
>>>> -----
>>>> ---------------
>>>> Alessandro Benedetti
>>>> Search Consultant, R&D Software Engineer, Director
>>>> Sease Ltd. - www.sease.io
>>>> --
>>>> View this message in context: http://lucene.472066.n3.nabble
>>>> .com/Solr-6-3-0-recovery-failed-tp4318324p4318327.html
>>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>>
>>>>


Mime
View raw message