lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Lewis Bianamara <stephen.bianam...@gmail.com>
Subject Re: Increasing Fault Tolerance of SOLR Cloud and Zookeeper
Date Fri, 14 Dec 2018 18:01:01 GMT
Thanks Erick, you've been very helpful. One other question I have, is it
reasonable to upgrade zookeeper on an in-place SOLR? I see that 12727
appears to be verified with SOLR 7 modulo some test issues. For SOLR 6.6,
would upgrading zookeeper to this version be advisable, or would you say
that it would be risky? Of course I'll stage in a test environment, but
it's hard to get the full story from just that...

Thanks!

On Thu, Dec 13, 2018 at 7:09 PM Erick Erickson <erickerickson@gmail.com>
wrote:

> bq. will the leader still report that there were two followers, even
> if one of them bounced
>
> I really can't say, I took the ZK folks' at their word and upgraded.
>
> I should think that restarting your ZK nodes should reestablish that
> they are all talking to each other, you may need to restart your Solr
> instances to see it take effect.
>
> Sorry I can't be more help
> Erick
> On Thu, Dec 13, 2018 at 3:15 PM Stephen Lewis Bianamara
> <stephen.bianamara@gmail.com> wrote:
> >
> > Thanks for the help Erick.
> >
> > This is an external zookeeper, running on three separate AWS instances
> > separate from the instances hosting SOLR. I think I have some more
> insight
> > based on the bug you sent and some more log crawling.
> >
> > In October we had an instance retirement, wherein the instance was
> > automatically stopped and restarted. We verified on that instance that
> echo
> > ruok | nc localhost <<PORT>> returned  imok . But, I just looked at
that
> > node with echo mntr | nc localhost <<PORT>>, and it appears to have
never
> > served a request! The first time I ran it there was 1 packet
> sent/received,
> > the next time 2 of each, the next time three.... It's reporting exactly
> the
> > number of times I run echo mntr | nc localhost <<PORT>> :) The other
two
> > machines each show millions of packets sent/received. It's quite weird
> > because the leader zookeeper, reports 2 synced followers now, yet I
> wonder
> > why hasn't the node ever served a request if that's true. Quite bizarre.
> >
> > The three instances to talk over internal dns, I'm not totally sure if
> the
> > IP of the instance changed after its stop/start. I have seen this both
> > change and not change on AWS, and I'm not sure what controls whether a
> > stop/start changes the private IP. But I wonder if we can rule anything
> > out; in the case of the dns bug 12727
> > <https://issues.apache.org/jira/browse/SOLR-12727>, will the leader
> still
> > report that there were two followers, even if one of them bounced?
> >
> > Finally, this log appears on the zookeeper machine and appears to be the
> > first sign of trouble Unexpected exception causing shutdown while sock
> > still open. I'm guessing that what's happened is that our zk cluster has
> a
> > failed quorum in some way, likely from 12727, but the leader still thinks
> > the other node is a follower. So I wonder what is the fix to this
> > situation? Is it to one-by-one stop and restart the other two zookeeper
> > processes?
> >
> > Thanks a bunch,
> > Stephen
> >
> > On Thu, Dec 13, 2018 at 8:10 AM Erick Erickson <erickerickson@gmail.com>
> > wrote:
> >
> > > Updates are disabled means that at least two of your three ZK nodes
> > > are unreachable, which is worrisome.
> > >
> > > First:
> > > That error is coming from Solr, but whether it's a Solr issue or a ZK
> > > issue is ambiguous. Might be explained if the ZK nodes are under heavy
> > > load. Question: Is this an external ZK ensemble? If so, what kind of
> > > load are those machines under? If you're using the embedded ZK, then
> > > stop-the-world GC could cause this.
> > >
> > > Second:
> > > Yeah, increasing timeouts is one of the tricks, but tracking down  why
> > > the response is so slow would be indicated in either case. I don't
> > > have much confidence in this solution in this case though. Losing
> > > quorum indicates something else as the culprit.
> > >
> > > Third:
> > > Not quite. The  whole point of specifying the ensemble is that the ZK
> > > client is smart enough to continue to function if quorum is present.
> > > So it is _not_ the case that all the ZK instances need to be
> > > reachable.
> > >
> > > On that topic, did you bounce your ZK servers or change them in any
> > > other way? There's a known ZK issue when you reconfigure live ZK
> > > ensembles, see: https://issues.apache.org/jira/browse/SOLR-12727
> > >
> > > Fourth:
> > > See above.
> > >
> > > HTH,
> > > Erick
> > > On Wed, Dec 12, 2018 at 11:06 PM Stephen Lewis Bianamara
> > > <stephen.bianamara@gmail.com> wrote:
> > > >
> > > > Hello SOLR Community!
> > > >
> > > > I have a SOLR cluster which recently hit this error (full error
> > > > below). ""Cannot
> > > > talk to ZooKeeper - Updates are disabled."" I'm running solr 6.6.2
> and
> > > > zookeeper 3.4.6.  The first time this happened, we replaced a node
> within
> > > > our cluster. The second time, we followed the advice in this post
> > > > <
> > >
> http://lucene.472066.n3.nabble.com/Cannot-talk-to-ZooKeeper-Updates-are-disabled-Solr-6-3-0-td4311582.html
> > > >
> > > > and just restarted the SOLR service, which resolved the issue. I
> traced
> > > > this down (at least the second time) to this message: ""WARN
> > > > (zkCallback-4-thread-31-processing-n:<<IP>>:<<PORT>>_solr)
[ ]
> > > > o.a.s.c.c.ConnectionManager Watcher
> > > > org.apache.solr.common.cloud.ConnectionManager@4586a480 name:
> > > > ZooKeeperConnection
> Watcher:zookeeper-1.dns.domain.foo:1234,zookeeper-2.
> > > > dns.domain.foo:1234,zookeeper-3. dns.domain.foo:1234 got event
> > > WatchedEvent
> > > > state:Disconnected type:None path:null path: null type: None"".
> > > >
> > > > I'm wondering a few things. First, can you help me understand what
> this
> > > > error means in this context? Did the Zookeepers themselves
> experience an
> > > > issue, or just the SOLR node trying to talk to the zookeepers? There
> was
> > > > only one SOLR node affected, which was the leader, and thus stopped
> all
> > > > writes. Any way to trace this to a specific resource limitation? Our
> ZK
> > > > cluster looks to be rather low utilization, but perhaps I'm missing
> > > > something.
> > > >
> > > > The second, what steps can I take to make the SOLR-zookeeper
> interaction
> > > > more fault tolerant in general? It seems to me like we might want to
> (a)
> > > > Increase the Zookeeper SyncLimit to provide more flexibility within
> the
> > > ZK
> > > > quorum, but this would only help if the issue was truly on the zk
> side.
> > > We
> > > > could also increase the tolerance on the SOLR side of things; would
> this
> > > be
> > > > controlled via the zkClientTimeout? Any other thoughts?
> > > >
> > > > The third, is there some more fault tolerant ZK Connection string
> than
> > > > listing out all three ZK nodes? I *think*, and please correct me if
> I'm
> > > > wrong, this will require all three ZK nodes to be reporting as
> healthy
> > > for
> > > > the SOLR node to consider the connection healthy. Is that true? Maybe
> > > > including all three does mean a 2/3 quorum only need be maintained.
> If
> > > the
> > > > connection health is based on quorum, Is moving a busy cluster to 5
> nodes
> > > > for a 3/5 quorum desirable? Any other recommendations to make this
> > > > healthier?
> > > >
> > > > Fourth, is any of the fault tolerance in this area improved in later
> > > > SOLR/Zookeeper versions?
> > > >
> > > > Finally, this looks to be connected to this Jira issue
> > > > <https://issues.apache.org/jira/browse/SOLR-3274>? The issue doesn't
> > > appear
> > > > to be very actionable unfortunately, but it appears people have
> wondered
> > > > this before. Are there any plans in the works to allow for recovery?
> We
> > > > found our ZK cluster was healthy and restarting the solr service
> fixed
> > > the
> > > > issue, so it seems a reasonable feature to add auto-recovery on the
> SOLR
> > > > side when the ZK cluster returns to healthy. Would you agree?
> > > >
> > > > Thanks for your help!!
> > > > Stephen
> > >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message