nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff <jtsw...@gmail.com>
Subject Re: Zookeeper issues at initial Cluster startup
Date Tue, 28 Feb 2017 19:31:31 GMT
Mark,

In my original response, I said that in zookeeper.propertiers, the server.N
properties should be set to the host:port of your ZK server, and that was
pretty ambiguous.  It should not be set to the same port as clientPort.

As Bryan mentioned, with the default clientPort set to 2181, typically the
server.N properties are set to hostname:2888:3888.  In your case, you might
want to try something like the following, as long as these ports are not
currently in use:
server.1=<FQND1>:2888:3888
server.2=<FQND1>:2888:3888
server.3=<FQND1>:2888:3888

Also, your settings for leader elections:
nifi.cluster.flow.election.max.wait.time=5 mins
nifi.cluster.flow.election.max.candidates=201

This will wait for 201 election candidates to connect, or 5 minutes.  You
might want to set the max candidates to 3, since you have 3 nodes in your
cluster.

The contents of ./state/zookeeper look correct, you should be okay there.


On Tue, Feb 28, 2017 at 2:19 PM Bryan Bende <bbende@gmail.com> wrote:

> Mark,
>
> I am not totally sure, but there could be an issue with the ports in
> some of the connect strings.
>
> In zookeeper.properties there is an entry for clientPort which
> defaults to 2181, the value of this property is what should be
> referenced in nifi.zookeeper.connect.string and state-management.xml
> Connect String, so if you left it alone then:
>
> FQDN1:2181,FQDN2:2181,FQDN3:2181
>
> In the server entries in zookeeper.properties, I believe they should
> be referencing different ports. For example, when using the default
> clientPort=2181 the server entries are typically like:
>
> server.1=localhost:2888:3888
>
> From the ZooKeeper docs the definition for these two ports is:
>
> "There are two port numbers nnnnn. The first followers use to connect
> to the leader, and the second is for leader election. The leader
> election port is only necessary if electionAlg is 1, 2, or 3
> (default). If electionAlg is 0, then the second port is not necessary.
> If you want to test multiple servers on a single machine, then
> different ports can be used for each server."
>
> In your configs it looks like the clientPort and the first port in the
> server string are both 11001, so I think making those different should
> do the trick.
>
> -Bryan
>
>
> On Tue, Feb 28, 2017 at 1:58 PM, Mark Bean <mark.o.bean@gmail.com> wrote:
> > Relevant properties from nifi.properties:
> > nifi.state.management.provider.cluster=zk-provider
> > nifi.state.management.embedded.zookeeper.start=true
> >
> nifi.state.management.embedded.zookeeper.properties=./conf/zookeeper.properties
> > nifi.cluster.protocol.heartbeat.interval=5 sec
> > nifi.cluster.protocol.is.secure=true
> > ## Security properties verified; they work for https in non-cluster
> > configuration
> >
> > nifi.cluster.is.node=true
> > nifi.cluster.node.address=FQDN1
> > nifi.cluster.node.protocol.port=9445
> > nifi.cluster.node.protocol.threads=10
> > nifi.cluster.node.event.history.size=25
> > nifi.cluster.node.connection.timeout=5 sec
> > nifi.cluster.node.read.timeout=5 sec
> > nifi.cluster.firewall.file=
> > nifi.cluster.flow.election.max.wait.time=5 mins
> > nifi.cluster.flow.election.max.candidates=201
> >
> > nifi.zookeeper.connect.string=FQDN1:11001,FQDN2:11001,FQDN3:11001
> > nifi.zookeeper.connect.tiemout=3 secs
> > nifi.zookeeper.session.timeout=3 secs
> > nifi.zookeeper.root.node=/nifi/test-cluster
> >
> > zookeeper.properties all default except added these lines:
> > server.1=<FQND1>:11001:11000
> > server.2=<FQND2>:11001:11000
> > server.3=<FQND3>:11001:11000
> >
> > state-management.xml all default except the following in
> <cluster-provider>:
> > <property name="Connect
> > String">FQDN1:11001,FQDN2:11001,FQDN3:11001</property>
> > <property name="Root Node">/nifi/test-cluster</property>
> >
> > Also, the ./state/zookeeper/myid consists of only "1", "2", or "3"
> > depending on the server within the cluster. Is this correct?
> >
> >
> > On Tue, Feb 28, 2017 at 1:24 PM, Jeff <jtswork@gmail.com> wrote:
> >
> >> Hello Mark,
> >>
> >> Sorry to hear that you're having issues with getting your cluster up and
> >> running.  Could you provide the content of your nifi.properties file?
> >> Also, please check the Admin guide for ZK setup [1], particularly the
> Flow
> >> Election and Basic Cluster Setup sections.
> >>
> >> By default, nifi.properties uses a 5-minute election duration to elect
> the
> >> primary node.  However, it does not have a default number of candidates
> for
> >> the election, so typically it will take 5 minutes for that election
> process
> >> when you have a 3-node cluster.  You could try
> >> setting nifi.cluster.flow.election.max.candidates to 3, and restart the
> >> cluster, but based on the errors you're seeing, I think there may be
> some
> >> other issues.
> >>
> >> Some key properties to check:
> >>
> >> nifi.properties:
> >> nifi.state.management.embedded.zookeeper.start (true for embedded ZK,
> >> false
> >> or blank if you're using an external ZK)
> >> nifi.zookeeper.connect.string (set to the connect string for your ZK
> >> quorum, regardless of embedded or external ZK, e.g.
> >> host1:2181,host2:2181,host3:2181)
> >>
> >> zookeeper.properties:
> >> server.1 (server.1 through server.N, should be set to the hostname:port
> of
> >> each ZK server in your cluster, regardless of embedded or external ZK)
> >>
> >> state-management.xml, under cluster-provider element:
> >> <property name="Connect String"></property> (set to the connect
string
> to
> >> access your ZK quorum, used by processors to store cluster-based state)
> >>
> >> [1]
> >> https://nifi.apache.org/docs/nifi-docs/html/administration-
> >> guide.html#clustering
> >>
> >> On Tue, Feb 28, 2017 at 12:56 PM Mark Bean <mark.o.bean@gmail.com>
> wrote:
> >>
> >> > I am attempting to setup a new Cluster with 3 Nodes initially. Each
> node
> >> is
> >> > reporting zookeeper/curator errors, and the Cluster is not able to
> >> connect
> >> > the Nodes. The error is reported many times per second and is
> continuous
> >> on
> >> > all Nodes:
> >> >
> >> > 2017-02-28 14:22:53,515 ERROR [Curator-Framework-0]
> >> > o.a.c.f.imps.CuratorFrameworkImpl Background operation retry gave up
> >> > org.apache.zookeeper.KeeperException$ConnectionLossException:
> >> > KeeperErrorCode = ConnectionLoss
> >> >         at
> >> > org.apache.zookeeper.KeeperException.create(KeeperException.java.99)
> >> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> >> >         at
> >> >
> >> > org.apache.curator.framework.imps.CuratorFrameworkImpl.
> >> checkBackgroundRetry(CuratorFrameworkImpl.java:728)
> >> > [curator-framework-2.11.0.jar:na]
> >> > at
> >> >
> >> > org.apache.curator.framework.imps.CuratorFrameworkImpl.
> >> performBackgroundOperation(CuratorFrameworkImpl.java:857)
> >> > [curator-framework-2.11.0.jar:na]
> >> > at
> >> >
> >> > org.apache.curator.framework.imps.CuratorFrameworkImpl.
> >> backgroundOperationsLoop(CuratorFrameworkImpl.java:809)
> >> > [curator-framework-2.11.0.jar:na]
> >> > at
> >> >
> >> > org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(
> >> CuratorFrameworkImpl.java:64)
> >> > [curator-framework-2.11.0.jar:na]
> >> > at
> >> >
> >> > org.apache.curator.framework.imps.CuratorFrameworkImpl.$4.
> >> call(CuratorFrameworkImpl.java:267)
> >> > [curator-framework-2.11.0.jar:na]
> >> > at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >> [na:1.8.0_121]
> >> > at
> >> >
> >> > java.util.concurrent.ScheduledThreadPoolExecutor$
> >> ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> >> > [na:1.8.0_121]
> >> > at
> >> >
> >> > java.util.concurrent.ScheduledThreadPoolExecutor$
> >> ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> >> > [na:1.8.0_121]
> >> > at
> >> >
> >> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> >> ThreadPoolExecutor.java:1142)
> >> > [na:1.8.0_121]
> >> > at
> >> >
> >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> >> ThreadPoolExecutor.java:617)
> >> > [na:1.8.0_121]
> >> > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
> >> > 2017-02-28 14:22:53,516 ERROR [Curator-Framework-0]
> >> > o.a.c.f.imps.CuratorFrameworkImpl Background retry gave up
> >> > org.apache.curator.CuratorConnectionLossException: KeeperErrorCode =
> >> > ConnectionLoss
> >> >         at
> >> >
> >> > org.apache.curator.framework.imps.CuratorFrameworkImpl.
> >> performBackgroundOperation(CuratorFramworkImpl.java:838)
> >> > [curator-framework-2.11.0.jar:na]
> >> > at
> >> >
> >> > org.apache.curator.framework.imps.CuratorFrameworkImpl.
> >> backgroundOperationsLoop(CuratorFrameworkImpl.java:809)
> >> > [curator-framework-2.11.0.jar:na]
> >> > at
> >> >
> >> > org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(
> >> CuratorFrameworkImpl.java:64)
> >> > [curator-framework-2.11.0.jar:na]
> >> > at
> >> >
> >> > org.apache.curator.framework.imps.CuratorFrameworkImpl.$4.
> >> call(CuratorFrameworkImpl.java:267)
> >> > [curator-framework-2.11.0.jar:na]
> >> > at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >> [na:1.8.0_121]
> >> > at
> >> >
> >> > java.util.concurrent.ScheduledThreadPoolExecutor$
> >> ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> >> > [na:1.8.0_121]
> >> > at
> >> >
> >> > java.util.concurrent.ScheduledThreadPoolExecutor$
> >> ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> >> > [na:1.8.0_121]
> >> > at
> >> >
> >> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> >> ThreadPoolExecutor.java:1142)
> >> > [na:1.8.0_121]
> >> > at
> >> >
> >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> >> ThreadPoolExecutor.java:617)
> >> > [na:1.8.0_121]
> >> > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
> >> >
> >> > While the above message was repeating in the log on one of the Nodes,
> >> > another Node's log was "stuck" for a period of time with the last
> message
> >> > being:
> >> >
> >> > INFO [main] o.a.nifi.properties.NiFiPropertiesLoader Loaded 122
> >> properties
> >> > from <path>/nifi.properties
> >> >
> >> > The next message to appear after nearly 6 minutes is:
> >> >
> >> > INFO [main] o.a.nifi.util.FileBasedVariableRegistry Loaded 91
> properties
> >> > from system properties and environment variables.
> >> >
> >> > The 6 minute delay seems curious.
> >> >
> >> > Then, the Node appears to start the zookeeper server but hits this
> error:
> >> >
> >> > ERROR [LearnerHandler-/10.6.218.9:22816]
> >> > o.a.z.server.quorum.LearnerHandler
> >> > Unexpected exception causing shutdown while sock still open
> >> > java.io.EOFException: null
> >> >         at java.io.DataInputStream.readInt(DataInputStream.java:392)
> >> > ~[na.1.8.0_121]
> >> > at
> >> > org.apache.jute.BinaryInputArchive.readString(
> >> BinaryInputArchive.java:79)
> >> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> >> > at org.apache.zookeeper.data.Id.deserialize(Id.java:55)
> >> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> >> > at
> >> > org.apache.jute.BinaryInputArchive.readRecord(
> >> BinaryInputArchive.java:103)
> >> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> >> > at
> >> >
> >> > org.apache.zookeeper.server.quorum.QuorumPacket.
> >> deserialze(QuorumPacket.java:92)
> >> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> >> > at
> >> > org.apache.jute.BinaryInputArchive.readRecord(
> >> BinaryInputArchive.java:103)
> >> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> >> > at
> >> >
> >> > org.apache.zookeeper.server.quorum.LearnerHandler.run(
> >> LearnerHandler.java:309)
> >> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> >> >
> >> > This is soon followed by the repeating errors shown above ("Background
> >> > operation retry gave up")
> >> >
> >> > It is as if the quorum vote does not succeed within a given timeframe
> and
> >> > then it stops trying. Note: on one attempt to start the Cluster
> >> > successfully, I removed all but one flow.xml.gz, and cleared all
> >> > information in ./state directory (except the ./state/zookeeper/myid
> >> file).
> >> >
> >> > Thanks for assistance in understanding what zookeeper is doing (or not
> >> > doing) when starting up a new Cluster.
> >> >
> >> > -Mark
> >> >
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message