lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Keeney <nextves...@gmail.com>
Subject Re: Configuration of SOLR Cluster
Date Wed, 28 Feb 2018 13:54:24 GMT
Shawn -

Thanks again for all your help.


   - On AWS side, I've confirmed that each of the members in the node are
   able to talk to each other. The security groups are setup so that all the
   members of the ensemble can receive all traffic from the other members of
   the ensemble.
   - The myid files are properly configured.
   - All nodes are open on to all traffic from all other nodes


I took your suggestion and upgraded all of the nodes to 3.4.11. No change
in the behavior.  However, I used this change to test out what is
happening.

This is the sequence:


   - If I stop one node in the ensemble, the remaining 2 nodes properly
   call and election and establish that there are only 2 nodes and who the
   leader is. So far so good.
   - When I restart the disconnected node though it cannot reconnect with
   the ensemble. The ensemble rejects the connection request.
   - I then restart the remaining 2 nodes and they all are able to connect
   again and the full ensemble is restored


I did notice one thing in the logs:

2018-02-28 13:21:58,932 [myid:1] - INFO  [/172.31.86.130:3888
:QuorumCnxManager$Listener@743] - *Received connection request
/172.31.73.122:34804 <http://172.31.73.122:34804>*
2018-02-28 13:21:58,934 [myid:1] - WARN
[RecvWorker:3:QuorumCnxManager$RecvWorker@1028] - Interrupting SendWorker
2018-02-28 13:21:58,934 [myid:1] - WARN
[SendWorker:3:QuorumCnxManager$SendWorker@941] - Interrupted while waiting
for message on queue
java.lang.InterruptedException
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1094)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:929)

When the restarted node attempts to reconnect with the ensemble it looks
like it does so on a random port. Could it be that nodes in the ensemble
are rejecting the new request to rejoin because they are not listening on
that port? And why is it not requesting on 3888:2888? This is confusing to
me.

I have attached a ZK log and a SOLR log. You can watch the whole
progression in the ZK log as it goes from happy to disconnected to trying
to reconnect to part of the ensemble when the other nodes are restarted.
Seems like ZK holds onto a state based on the original ensemble
interactions and that state prevents the node from rejoining the ensemble.
The state is then lost with the restart which allows the members to
re-establish connection and form the new ensemble.

You are right. This is definitely a ZK thing. Solr just observes that it
can no longer connect to one of the members of the ensemble in the list it
received. SOLR appears to get progressively upset about the fact until
finally it through an exception and returns to complaining.

Let me know if you want me to take it over to the ZK mailing list.

Jim K.







On Tue, Feb 27, 2018 at 10:35 PM Shawn Heisey <elyograg@elyograg.org> wrote:

> On 2/27/2018 6:42 PM, James Keeney wrote:
> > -DzkHost=<ZK Host internal IP 1>:2181,<ZK Host internal IP 2>:2181,<ZK
> Host
> > internal IP 1>:2181
>
> This looks correct, except that with AWS, I have no idea whether you
> need the internal IP addressing or the external IP addressing.  If all
> of the machines involved (both servers and clients) are able to
> communicate on the internal addresses, then that should be fine.  You
> might want to discuss the IP addressing with Amazon just to make sure.
>
> > java.net.ConnectException: Connection refused
>
> All of the logs you included look like they have this message --
> connection refused.  Normally this happens when the software isn't
> running -- the OS refuses connections when no software is listening on a
> TCP port.  Sometimes firewalls can refuse connections, but more commonly
> they just drop the traffic silently, and the system starting the
> connection has to wait for a timeout and never gets any kind of
> response.  In this case, there IS a response -- the connection is refused.
>
> It looks like you've pasted parts of the log, but I was actually hoping
> for entire logfiles, or at least entire sections of logfiles, to see
> errors in context with non-errors, and to be sure that nothing is lost,
> and that the formatting isn't destroyed by inclusion in an email
> message.  A paste website or a file sharing website is often the best
> way to share that kind of information.  If you need to redact
> information from the files, please do so in a way that preserves the
> ability to decipher the log.  For IP addresses, you could just redact
> the first two octets and leave the last two -- although if they are
> private addresses, you could leave them intact.
>
> My instinct here is to think there's either a fundamental networking
> issue (firewalls, other problems), or that there may be some kind of
> problem with ZK.  What version of ZK are you using on the servers, and
> what version of Solr is it?
>
> My instincts could be wrong because of a limited understanding of how ZK
> functions.
>
> My recommendation would be to run ZK version 3.4.11 on your servers.
> Each new release of ZK has a very impressive list of fixed bugs.  The
> client ZK version will depend on the Solr version, since the ZK jar is
> part of Solr.
>
> I looked at your ZK server config.  Your initLimit value is ten times
> what the default config for the embedded ZK in Solr is. Based on the
> comment in the embedded ZK config, that's probably not a problem, but I
> can't say for sure without more ZK knowledge.  The other parts of the
> config seem normal enough.
>
> Are you configuring the "myid" file in each ZK server's data directory,
> and does the value on each server correspond to the line in the ZK
> config for that server?  I assume you probably have this correct,
> because ZK probably wouldn't work at all if it wasn't right.
>
> I really don't know what might be going on.  Maybe with more complete
> logs I might spot something, but I don't know.
>
> Thanks,
> Shawn
>
> --
Jim Keeney
President, FitterWeb
E: jim@fitterweb.com
M: 703-568-5887

*FitterWeb Consulting*
*Are you lean and agile enough? *

Mime
View raw message