nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Bean <mark.o.b...@gmail.com>
Subject Re: unstable cluster
Date Tue, 30 May 2017 12:15:00 GMT
Updated to external ZooKeeper last Friday. Over the weekend, there are no
reports of SUSPENDED or RECONNECTED.

Are there plans to upgrade the embedded ZooKeeper to the latest version,
3.4.10?

Thanks,
Mark

On Thu, May 25, 2017 at 11:56 AM, Joe Witt <joe.witt@gmail.com> wrote:

> looked at a secured cluster and the send times are routinely at 100ms
> similar to yours.  I think what i was flagging as potentially
> interesting is not interesting at all.
>
> On Thu, May 25, 2017 at 11:34 AM, Joe Witt <joe.witt@gmail.com> wrote:
> > Ok.  Well as a point of comparison i'm looking at heartbeat logs from
> > another cluster and the times are consistently 1-3 millis for the
> > send.  Yours above show 100+ms typical with one north of 900ms.  Not
> > sure how relevant that is but something i noticed.
> >
> > On Thu, May 25, 2017 at 11:29 AM, Mark Bean <mark.o.bean@gmail.com>
> wrote:
> >> ping shows acceptably fast response time between servers, approximately
> >> 0.100-0.150 ms
> >>
> >>
> >> On Thu, May 25, 2017 at 11:13 AM, Joe Witt <joe.witt@gmail.com> wrote:
> >>
> >>> have you evaluated latency across the machines in your cluster?  I ask
> >>> because 122ms is pretty long and 917ms is very long.  Are these nodes
> >>> across a WAN link?
> >>>
> >>> On Thu, May 25, 2017 at 11:08 AM, Mark Bean <mark.o.bean@gmail.com>
> wrote:
> >>> > Update: now all 5 nodes, regardless of ZK server, are indicating
> >>> SUSPENDED
> >>> > -> RECONNECTED.
> >>> >
> >>> > On Thu, May 25, 2017 at 10:23 AM, Mark Bean <mark.o.bean@gmail.com>
> >>> wrote:
> >>> >
> >>> >> I reduced the number of embedded ZooKeeper servers on the 5-Node
> NiFi
> >>> >> Cluster from 5 to 3. This has improved the situation. I do not
see
> any
> >>> of
> >>> >> the three Nodes which are also ZK servers
> disconnecting/reconnecting to
> >>> the
> >>> >> cluster as before. However, the two Nodes which are not running
ZK
> >>> continue
> >>> >> to disconnect and reconnect. The following is taken from one of
the
> >>> non-ZK
> >>> >> Nodes. It's curious that some messages are issued twice from the
> same
> >>> >> thread, but reference a different object
> >>> >>
> >>> >> nifi-app.log
> >>> >> 2017-05-25 13:40:01,628 INFO [main-EventTrhead] o.a.c.f.state.
> >>> ConnectionStateManager
> >>> >> State change: SUSPENDED
> >>> >> 2017-05-25 13:39:45,627 INFO [Clustering Tasks Thread-1] o.a.n.c.c.
> >>> ClusterProtocolHeaertbeater
> >>> >> Heartbeat create at 2017-05-25 13:39:45,504 and sent to FQDN:PORT
at
> >>> >> 2017-05-25 13:39:45,627; send took 122 millis
> >>> >> 2017-05-25 13:39:50,862 INFO [Clustering Tasks Thread-1] o.a.n.c.c.
> >>> ClusterProtocolHeaertbeater
> >>> >> Heartbeat create at 2017-05-25 13:39:50,732 and sent to FQDN:PORT
at
> >>> >> 2017-05-25 13:39:50,862; send took 122 millis
> >>> >> 2017-05-25 13:39:56,089 INFO [Clustering Tasks Thread-1] o.a.n.c.c.
> >>> ClusterProtocolHeaertbeater
> >>> >> Heartbeat create at 2017-05-25 13:39:55,966 and sent to FQDN:PORT
at
> >>> >> 2017-05-25 13:39:56,089; send took 129 millis
> >>> >> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
> >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> org.apache.nifi.controller.
> >>> >> leader.election.CuratorLeaderElectionManager$
> ElectionListener@68f8b6a2
> >>> >> Connection State changed to SUSPENDED
> >>> >> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
> >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> org.apache.nifi.controller.
> >>> >> leader.election.CuratorLeaderElectionManager$
> ElectionListener@663f55cd
> >>> >> Connection State changed to SUSPENDED
> >>> >> 2017-05-25 13:40:02,412 INFO [main-EventThread] o.a.c.f.state.
> >>> ConnectinoStateManager
> >>> >> State change: RECONNECTED
> >>> >> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
> >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> org.apache.nifi.controller.
> >>> >> leader.election.CuratorLeaderElectionManager$
> ElectionListener@68f8b6a2
> >>> >> Connection State changed to RECONNECTED
> >>> >> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
> >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager
> org.apache.nifi.controller.
> >>> >> leader.election.CuratorLeaderElectionManager$
> ElectionListener@663f55cd
> >>> >> Connection State changed to RECONNECTED
> >>> >> 2017-05-25 13:40:02,550 INFO [Clustering Tasks Thread-1] o.a.n.c.c.
> >>> ClusterProtocolHeaertbeater
> >>> >> Heartbeat create at 2017-05-25 13:40:01,632 and sent to FQDN:PORT
at
> >>> >> 2017-05-25 13:40:02,550; send took 917 millis
> >>> >> 2017-05-25 13:40:07,787 INFO [Clustering Tasks Thread-1] o.a.n.c.c.
> >>> ClusterProtocolHeaertbeater
> >>> >> Heartbeat create at 2017-05-25 13:40:07,657 and sent to FQDN:PORT
at
> >>> >> 2017-05-25 13:40:07,787; send took 129 millis
> >>> >>
> >>> >> I will work on setting up an external ZK next, but would still
like
> some
> >>> >> insight to what is being observed with the embedded ZK.
> >>> >>
> >>> >> Thanks,
> >>> >> Mark
> >>> >>
> >>> >>
> >>> >>
> >>> >>
> >>> >> On Wed, May 24, 2017 at 3:57 PM, Mark Bean <mark.o.bean@gmail.com>
> >>> wrote:
> >>> >>
> >>> >>> Yes, we are using the embedded ZK. We will try instantiating
and
> >>> external
> >>> >>> ZK and see if that resolves the problem.
> >>> >>>
> >>> >>> The load on the system is extremely small. Currently (as Nodes
are
> >>> >>> disconnecting/reconnecting) all input ports to the flow are
turned
> >>> off. The
> >>> >>> only data in the flow is from a single GenerateFlow generating
5B
> >>> every 30
> >>> >>> secs.
> >>> >>>
> >>> >>> Also, it is a 5-node cluster with embedded ZK on each node.
First,
> I
> >>> will
> >>> >>> try reducing ZK to only 3 nodes. Then, I will try a 3-node
> external ZK.
> >>> >>>
> >>> >>> Thanks,
> >>> >>> Mark
> >>> >>>
> >>> >>> On Wed, May 24, 2017 at 11:49 AM, Joe Witt <joe.witt@gmail.com>
> wrote:
> >>> >>>
> >>> >>>> Are you using the embedded Zookeeper?  If yes we recommend
using
> an
> >>> >>>> external zookeeper.
> >>> >>>>
> >>> >>>> What type of load are the systems under when this occurs
(cpu,
> >>> >>>> network, memory, disk io)? Under high load the default
timeouts
> for
> >>> >>>> clustering are too aggressive.  You can relax these for
higher
> load
> >>> >>>> clusters and should see good behavior.  Even if the system
> overall is
> >>> >>>> not under all that high of load if you're seeing garbage
> collection
> >>> >>>> pauses that are lengthy and/or frequent it can cause the
same high
> >>> >>>> load effect as far as the JVM is concerned.
> >>> >>>>
> >>> >>>> Thanks
> >>> >>>> Joe
> >>> >>>>
> >>> >>>> On Wed, May 24, 2017 at 9:11 AM, Mark Bean <mark.o.bean@gmail.com
> >
> >>> >>>> wrote:
> >>> >>>> > We have a cluster which is showing signs of instability.
The
> Primary
> >>> >>>> Node
> >>> >>>> > and Coordinator are reassigned to different nodes
every several
> >>> >>>> minutes. I
> >>> >>>> > believe this is due to lack of heartbeat or other
coordination.
> The
> >>> >>>> > following error occurs periodically in the nifi-app.log
> >>> >>>> >
> >>> >>>> > ERROR [CommitProcessor:1] o.apache.zookeeper.server.
> NIOServerCnxn
> >>> >>>> > Unexpected Exception:
> >>> >>>> > java.nio.channels.CancelledKeyException: null
> >>> >>>> >         at sun.nio.ch.SelectionKeyImpl.en
> >>> >>>> sureValid(SectionKeyImpl.java:73)
> >>> >>>> >         at sun.nio.ch.SelectionKeyImpl.in
> >>> >>>> terestOps(SelctionKeyImpl.java:77)
> >>> >>>> >         at
> >>> >>>> > org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServ
> >>> >>>> erCnxn.java:151)
> >>> >>>> >         at
> >>> >>>> > org.apache.zookeeper.server.NIOServerCnXn.sendResopnse(NIOSe
> >>> >>>> rverCnxn.java:1081)
> >>> >>>> >         at
> >>> >>>> > org.apache.zookeeper.server.FinalRequestProcessor.processReq
> >>> >>>> uest(FinalRequestProcessor.java:404)
> >>> >>>> >         at
> >>> >>>> > org.apache.zookeeper.server.quorum.CommitProcessor.run(Commi
> >>> >>>> tProcessor.java:74)
> >>> >>>> >
> >>> >>>> > Apache NiFi 1.2.0
> >>> >>>> >
> >>> >>>> > Thoughts?
> >>> >>>>
> >>> >>>
> >>> >>>
> >>> >>
> >>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message