lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Underwood <wun...@wunderwood.org>
Subject Re: Cluster with no overseer?
Date Wed, 22 May 2019 16:11:30 GMT
The ZK ensemble appears to be OK. It is the Solr-related stuff that is borked. There are 110
items in /overseer/collection-queue-work/, which doesn’t seem healthy.

If it is really hosed, I’ll shut down all the nodes, clean out the files in Zookeeper and
start over.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 22, 2019, at 8:53 AM, Erick Erickson <erickerickson@gmail.com> wrote:
> 
> Good luck, this kind of assumes that your ZK ensemble is healthy of course...
> 
>> On May 22, 2019, at 8:23 AM, Walter Underwood <wunder@wunderwood.org> wrote:
>> 
>> Thanks, we’ll try that. Bouncing one Solr node doesn’t fix it, because we did
a rolling restart yesterday.
>> 
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On May 22, 2019, at 8:21 AM, Erick Erickson <erickerickson@gmail.com> wrote:
>>> 
>>> Walter:
>>> 
>>> I have no idea what the root cause is here, this really shouldn’t happen. But
the Overseer role (and I’m assuming you’re talking Solr’s Overseer) is assigned similarly
to a shard leader, the same election process happens. All the election nodes are ephemeral
ZK nodes.
>>> 
>>> Solr’s Overseer is _not_ fixed to a particular Solr node, although you can
assign a preferred role of Overseer in those (rare) cases where there are so many state changes
for ZooKeeper that it’s advisable for them to run on a dedicated machine.
>>> 
>>> Overseer assignment is automatic. This should work;
>>> 1> shut everything down, Solr and Zookeeper
>>> 2> start your ZooKeepers and let them all get in sync with each other
>>> 3> start your Solr nodes. It might take 3 minutes or more to bring up the
first Solr node, there’s up to a 180 second delay if leaders are not findable easily.
>>> 
>>> That should cause Solr to elect an overseer, probably the first Solr node to
come up.
>>> 
>>> It _might_ work to bounce just one Solr node, seeing the Overseer election queue
empty it may elect itself. That said, the overseer election queue won’t contain the rest
of the Solr nodes like it should, so if that works you should probably bounce the rest of
the Solr servers one by one to restore the proper election queue process.
>>> 
>>> Not a fix for the root cause of course, but should get things operating again.
I’ll add that I haven’t seen this happen in the field to my recollection, if at all.
>>> 
>>> Best,
>>> Erick
>>> 
>>>> On May 21, 2019, at 9:04 PM, Will Martin <wmartin@urgent.ly> wrote:
>>>> 
>>>> Worked with Fusion and Zookeeper at GSA for 18 months: admin role.
>>>> 
>>>> Before blowing it away, you could try:
>>>> 
>>>> - id a candidate node, with a snapshot you just might think is old enough
>>>> to be robust.
>>>> - clean data for zk nodes otherwise.
>>>> - bring up the chosen node and wait for it to settle[wish i could remember
>>>> why i called what i saw that]
>>>> - bring up other nodes 1 at a time.  let each one fully sync to follower
of
>>>> the new leader.
>>>> - they should each in turn request the snapshot from the lead. then you
>>>> have
>>>> 
>>>> : align your collections with the ensemble. and for the life of me i can't
>>>> remember there being anything particularly tricky about that with fusion
,
>>>> which means I can't remember what I did... or have it doc'd at home. ;-)
>>>> 
>>>> 
>>>> Will Martin
>>>> DEVOPS ENGINEER
>>>> 540.454.9565
>>>> 
>>>> 8609 WESTWOOD CENTER DR, SUITE 475
>>>> VIENNA, VA 22182
>>>> geturgently.com
>>>> 
>>>> 
>>>> On Tue, May 21, 2019 at 11:40 PM Walter Underwood <wunder@wunderwood.org>
>>>> wrote:
>>>> 
>>>>> Yes, please. I have the logs from each of the Zookeepers.
>>>>> 
>>>>> We are running 3.4.12.
>>>>> 
>>>>> wunder
>>>>> Walter Underwood
>>>>> wunder@wunderwood.org
>>>>> http://observer.wunderwood.org/  (my blog)
>>>>> 
>>>>>> On May 21, 2019, at 6:49 PM, Will Martin <wmartin@urgent.ly>
wrote:
>>>>>> 
>>>>>> Walter. Can I cross-post to zk-dev?
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Will Martin
>>>>>> DEVOPS ENGINEER
>>>>>> 540.454.9565
>>>>>> 
>>>>>> <urgently-email-logo>
>>>>>> 
>>>>>> 8609 WESTWOOD CENTER DR, SUITE 475
>>>>>> VIENNA, VA 22182
>>>>>> geturgently.com <http://geturgently.com/>
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On May 21, 2019, at 9:26 PM, Will Martin <wmartin@urgent.ly
<mailto:
>>>>> wmartin@urgent.ly>> wrote:
>>>>>>> 
>>>>>>> +1
>>>>>>> 
>>>>>>> Will Martin
>>>>>>> DEVOPS ENGINEER
>>>>>>> 540.454.9565
>>>>>>> 
>>>>>>> 8609 WESTWOOD CENTER DR, SUITE 475
>>>>>>> VIENNA, VA 22182
>>>>>>> geturgently.com <http://geturgently.com/>
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, May 21, 2019 at 7:39 PM Walter Underwood <wunder@wunderwood.org
>>>>> <mailto:wunder@wunderwood.org>> wrote:
>>>>>>> ADDROLE times out after 180 seconds. This seems to be an unrecoverable
>>>>> state for the cluster, so that is a pretty serious bug.
>>>>>>> 
>>>>>>> wunder
>>>>>>> Walter Underwood
>>>>>>> wunder@wunderwood.org <mailto:wunder@wunderwood.org>
>>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
 (my
>>>>> blog)
>>>>>>> 
>>>>>>>> On May 21, 2019, at 4:10 PM, Walter Underwood <wunder@wunderwood.org
>>>>> <mailto:wunder@wunderwood.org>> wrote:
>>>>>>>> 
>>>>>>>> We have a 6.6.2 cluster in prod that appears to have no overseer.
In
>>>>> /overseer_elect on ZK, there is an election folder, but no leader document.
>>>>> An OVERSEERSTATUS request fails with a timeout.
>>>>>>>> 
>>>>>>>> I’m going to try ADDROLE, but I’d be delighted to hear
any other
>>>>> ideas. We’ve diverted all the traffic to the backing cluster, so we
can
>>>>> blow this one away and rebuild.
>>>>>>>> 
>>>>>>>> Looking at the Zookeeper logs, I see a few instances of network
>>>>> failures across all three nodes.
>>>>>>>> 
>>>>>>>> wunder
>>>>>>>> Walter Underwood
>>>>>>>> wunder@wunderwood.org <mailto:wunder@wunderwood.org>
>>>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
>>>>> (my blog)
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>> 
>> 
> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message