nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Skora <jsk...@gmail.com>
Subject Re: Questions about heterogeneous cluster and queue problem/bug/oddity in 1.0.0
Date Tue, 27 Sep 2016 13:08:08 GMT
Joe,

Thanks, your tuning comments all make sense.

If they didn't have the similar CPU and RAM scales I probably would not
have tried it.  It's only been running a couple of days, but I've already
noticed some anecdotal performance differences.  For instance, the Linux
and OSX nodes appear process more flow files than the Windows node, I don't
know if that's due to the SSDs or the different file systems.

The cluster runs better than I expected for non-server hardware.  I haven't
hammered it hard yet, but eventually I'll pull together some NiFi
performance stats and system/OS benchmark control numbers.

I had some bad hot spots in the flow, specifically before the ControlRate
and UpdateAttribute processors, so I tried splitting the flow with a
DistributeLoad to 3 instances of each and did the same for the highest
volume PutFile too.  That made a big difference and the hot spots were
gone.  Now there are several warms spots, but the queue sizes are much more
even across the graph and a big influx of files moves more steadily through
the graph instead of racing from one backup to the next.  Does that make
sense?

Joe

On Tue, Sep 27, 2016 at 8:31 AM, Joe Witt <joe.witt@gmail.com> wrote:

> JoeS
>
> I think you are seeing a queue bug that has been corrected or reported
> on the 1.x line.
>
> As for the frankencluster concept i think it is generally fair game.
> There are a number of design reasons, most notably back pressure, that
> make this approach feasible.  So the big ticket items to consider are
> things like
>
> CPU
> Since the model of NiFi is that basically all processes/tasks are
> eligible to run on all nodes and that when configuring the number of
> threads and tasks per controller and component that they are applied
> to all nodes this could be problematic when there is a substantive
> imbalance of power on the various systems.  If this were important to
> improve we could allow node-local overrides of max controller threads.
> That helps a bit but doesn't really solve it.  Again back pressure is
> probably the most effective.  There are probably a number of things we
> could do here if needed.
>
> Disk
> We have to consider the speed, congestion, and storage available on
> the disk(s) and how they're partitioned and such for our various
> repositories.  Again back pressure is one of the more effective
> mechanisms here because it is all about doing as much as you can which
> means other nodes should be able to take on more/less.  Fortunately
> the configuration of the repositories and such here are node-local so
> we can have pretty considerable variety here and things work pretty
> well.
>
> Network
> Back pressure for the win.  Though significant imbalances could lead
> to significant congestion which could cause inefficiencies in general
> so would need to be careful.  That scenario would require wildly
> imbalanced node capabilities and very high rate flows most likely.
>
> Memory
> JVM Heap size variability and/or off heap memory differences could
> cause some nodes to behave wildly different than others in ways that
> back pressure will not necessarily solve.  For instance a node with
> too low heap size for the types of processes in the flow could yield
> order(s) of magnitude lower performance than another node.  We should
> do more for these things.  Users should not have to configure things
> like swapping thresholds for instance.  We should at runtime determine
> and tune those values.  It is simply too hard to find a good magic
> number that predicts the likely number of flow file attributes and
> size that might be needed and those can have a substantial impact on
> heap usage.  Right now we treat swapping on a per queue basis though
> it is configured globally.  If you have say just 100 queues each
> holding in memory 1000 flowfiles you have all the attributes of those
> 100,000 flowfiles in memory.  If each flow file took up just 1KB of
> memory we're talking 100+MB.  Perhaps a slightly odd example but users
> aren't going to go through and think about every queue and the optimal
> global swapping setting.  Though it is an important number.  The
> system should be watching them all and doing this automatically.  That
> could help quite a lot.  We may also end up needing to not even have
> flowfile attributes held in memory though supporting this would
> require API changes to ensure they're only accessed in stream friendly
> ways.  Doing this for all uses of EL is probably pretty
> straightforward but all the direct attribute map accesses would need
> consideration.
>
> ...And we also need to think through things like
>
> OS Differences in accessing resources
> We generally follow "Pure Java (tm)" practices where possible.  So
> this helps a lot.  But still things like accessing specific file paths
> as might be needed in flow configurations themselves (GetFile/PutFile
> for example) could be tricky (but doable).
>
> The protocols used to source data matter a lot
> With all this talk of back pressure keep in mind that how data gets
> into NiFi becomes really critical in these clusters.  If you use
> protocols which do not afford fault tolerance and load balancing then
> things are not great.  So protocols which have queuing semantics or
> feedback mechanisms or let NiFi as the consumer control things will
> work out well.  Some portions of JMS are good for this.  Kafka is good
> for this.  NiFi's own site-to-site is good for this.
>
> The frankencluster testing is a valuable way to force and think
> through interesting issues. Maybe the frankencluster as you have it
> isn't realistic but it still exposes the concepts that need to be
> thought through for cases that definitely are.
>
> Thanks
> Joe
>
> On Tue, Sep 27, 2016 at 7:37 AM, Joe Skora <jskora@gmail.com> wrote:
> > The images just show what the text described, 13 files queued, EmptyQueue
> > returns 0 of 13 removed, and ListQueue returns the queue has no
> flowfiles.
> >
> > There were 13 files of 1k sitting in a queue between a SegmentContent and
> > ControlRate.  After I sent that email I had to stop/start the processors
> a
> > couple of times for other things and somewhere in the midst of that the
> > queue cleared.
> >
> >
> >
> > On Mon, Sep 26, 2016 at 11:05 PM, Peter Wicks (pwicks) <
> pwicks@micron.com>
> > wrote:
> >
> >> Joe,
> >>
> >> I didn’t get the images (might just be my exchange server). How many
> files
> >> are in the queue? (exact count please)
> >>
> >> --Peter
> >>
> >> From: Joe Skora [mailto:jskora@gmail.com]
> >> Sent: Monday, September 26, 2016 8:20 PM
> >> To: dev@nifi.apache.org
> >> Subject: Questions about heterogeneous cluster and queue
> >> problem/bug/oddity in 1.0.0
> >>
> >> I have a 3 node test franken-cluster that I'm abusing for the sake of
> >> learning.  The systems run Ubuntu 15.04, OS X 10.11.6, and Windows 10
> and
> >> though far comparable each has quad-core i7 between 2.5 and 3.5 GHz and
> >> 16GB of RAM.  Two have SSDs and the third has a 7200RPM SATA III drive.
> >>
> >> 1) Is there any reason mixing operating systems with the cluster would
> be
> >> a bad idea.  Once configured it seems to run ok.
> >> 2) Will performance disparities affect reliable ability or performance
> >> within the cluster?
> >> 3) Are there ways to configure disparate systems such that they can all
> >> perform at peak?
> >>
> >> The bug or issues I have run into is a queue showing files that can't be
> >> remove or listed.  Screen shots attached below.  I don't know if it's a
> >> mixed-OS issues, something I did while torturing the systems (all stayed
> >> up, this time), or just a weird anomaly.
> >>
> >> Regards,
> >> Joe
> >>
> >> Trying to empty queue seen in background
> >> [Inline image 1]
> >>
> >> but the flowfiles cannot be deleted.
> >> [Inline image 2]
> >>
> >> But try to list them and it says there are no files in the queue?
> >> [Inline image 3]
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message