nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff <jtsw...@gmail.com>
Subject Re: Zookeeper issues at initial Cluster startup
Date Tue, 28 Feb 2017 18:24:12 GMT
Hello Mark,

Sorry to hear that you're having issues with getting your cluster up and
running.  Could you provide the content of your nifi.properties file?
Also, please check the Admin guide for ZK setup [1], particularly the Flow
Election and Basic Cluster Setup sections.

By default, nifi.properties uses a 5-minute election duration to elect the
primary node.  However, it does not have a default number of candidates for
the election, so typically it will take 5 minutes for that election process
when you have a 3-node cluster.  You could try
setting nifi.cluster.flow.election.max.candidates to 3, and restart the
cluster, but based on the errors you're seeing, I think there may be some
other issues.

Some key properties to check:

nifi.properties:
nifi.state.management.embedded.zookeeper.start (true for embedded ZK, false
or blank if you're using an external ZK)
nifi.zookeeper.connect.string (set to the connect string for your ZK
quorum, regardless of embedded or external ZK, e.g.
host1:2181,host2:2181,host3:2181)

zookeeper.properties:
server.1 (server.1 through server.N, should be set to the hostname:port of
each ZK server in your cluster, regardless of embedded or external ZK)

state-management.xml, under cluster-provider element:
<property name="Connect String"></property> (set to the connect string to
access your ZK quorum, used by processors to store cluster-based state)

[1]
https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#clustering

On Tue, Feb 28, 2017 at 12:56 PM Mark Bean <mark.o.bean@gmail.com> wrote:

> I am attempting to setup a new Cluster with 3 Nodes initially. Each node is
> reporting zookeeper/curator errors, and the Cluster is not able to connect
> the Nodes. The error is reported many times per second and is continuous on
> all Nodes:
>
> 2017-02-28 14:22:53,515 ERROR [Curator-Framework-0]
> o.a.c.f.imps.CuratorFrameworkImpl Background operation retry gave up
> org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss
>         at
> org.apache.zookeeper.KeeperException.create(KeeperException.java.99)
> ~[zookeeper-3.4.6.jar:3.4.6-1569965]
>         at
>
> org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:728)
> [curator-framework-2.11.0.jar:na]
> at
>
> org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:857)
> [curator-framework-2.11.0.jar:na]
> at
>
> org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809)
> [curator-framework-2.11.0.jar:na]
> at
>
> org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64)
> [curator-framework-2.11.0.jar:na]
> at
>
> org.apache.curator.framework.imps.CuratorFrameworkImpl.$4.call(CuratorFrameworkImpl.java:267)
> [curator-framework-2.11.0.jar:na]
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_121]
> at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> [na:1.8.0_121]
> at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> [na:1.8.0_121]
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [na:1.8.0_121]
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [na:1.8.0_121]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
> 2017-02-28 14:22:53,516 ERROR [Curator-Framework-0]
> o.a.c.f.imps.CuratorFrameworkImpl Background retry gave up
> org.apache.curator.CuratorConnectionLossException: KeeperErrorCode =
> ConnectionLoss
>         at
>
> org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFramworkImpl.java:838)
> [curator-framework-2.11.0.jar:na]
> at
>
> org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809)
> [curator-framework-2.11.0.jar:na]
> at
>
> org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64)
> [curator-framework-2.11.0.jar:na]
> at
>
> org.apache.curator.framework.imps.CuratorFrameworkImpl.$4.call(CuratorFrameworkImpl.java:267)
> [curator-framework-2.11.0.jar:na]
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_121]
> at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> [na:1.8.0_121]
> at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> [na:1.8.0_121]
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [na:1.8.0_121]
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [na:1.8.0_121]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
>
> While the above message was repeating in the log on one of the Nodes,
> another Node's log was "stuck" for a period of time with the last message
> being:
>
> INFO [main] o.a.nifi.properties.NiFiPropertiesLoader Loaded 122 properties
> from <path>/nifi.properties
>
> The next message to appear after nearly 6 minutes is:
>
> INFO [main] o.a.nifi.util.FileBasedVariableRegistry Loaded 91 properties
> from system properties and environment variables.
>
> The 6 minute delay seems curious.
>
> Then, the Node appears to start the zookeeper server but hits this error:
>
> ERROR [LearnerHandler-/10.6.218.9:22816]
> o.a.z.server.quorum.LearnerHandler
> Unexpected exception causing shutdown while sock still open
> java.io.EOFException: null
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
> ~[na.1.8.0_121]
> at
> org.apache.jute.BinaryInputArchive.readString(BinaryInputArchive.java:79)
> ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> at org.apache.zookeeper.data.Id.deserialize(Id.java:55)
> ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
> ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> at
>
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialze(QuorumPacket.java:92)
> ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
> ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> at
>
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:309)
> ~[zookeeper-3.4.6.jar:3.4.6-1569965]
>
> This is soon followed by the repeating errors shown above ("Background
> operation retry gave up")
>
> It is as if the quorum vote does not succeed within a given timeframe and
> then it stops trying. Note: on one attempt to start the Cluster
> successfully, I removed all but one flow.xml.gz, and cleared all
> information in ./state directory (except the ./state/zookeeper/myid file).
>
> Thanks for assistance in understanding what zookeeper is doing (or not
> doing) when starting up a new Cluster.
>
> -Mark
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message