nifi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Purtell (JIRA)" <>
Subject [jira] [Commented] (NIFI-337) Automated cluster load balancing
Date Mon, 09 Feb 2015 23:35:35 GMT


Andrew Purtell commented on NIFI-337:

​So the key will be how to devise and implement an algorithm or approach to spreading that
load intelligently and so data doesn't just bounce back and forth.  If anyone knows of good
papers, similar systems, or approaches they can describe for how to think through this that
would be great.  Things we'll have to think about here that come to mind:

- When to start spreading the load (at what factor should we start spreading work across the

- Whether it should auto-spread by default and the user can tell it not to in certain cases
or whether it should not spread by default and the user can activate it

- What the criteria are by which we should let a user control how data is partitioned (some
key, round robin, etc..).   How to rebalance/re-assign partitions if a node dies or comes

There are 'counter cases' too that we must keep in mind such as aggregation or bin packing
grouped by some key.  In those cases all data would need to be merged together at some point
and thus all data needs to be accessible at some point. Whether that means we direct all data
to a single node or whether we enable cross-cluster data addressing is also a topic there.

> Automated cluster load balancing
> --------------------------------
>                 Key: NIFI-337
>                 URL:
>             Project: Apache NiFi
>          Issue Type: New Feature
>            Reporter: Andrew Purtell
> On dev@ in response to an inquiry, from [~joewitt]:
> {quote}
> The processors themselves are available and ready to run on all nodes at all times. 
It's really just a question of whether they have data to run on.  We have always taken the
view that 'if you want scalable dataflow' use scalable interfaces. And I think that is the
way to go in every case you can pull it off. That generally meant one should use datasources
which offer queueing semantics where multiple independent nodes can pull from the queue with
'at-least-once' guarantees.  In addition each node has back pressure so if it falls behind
it slows its rate of pickup which means other nodes in the cluster can pickup the slack. 
This has worked extremely well.
> That said, I recognize that it isn't always possible to use scalable interfaces and given
enough non-scalable datasources the cluster could become out of balance.  So this certainly
seems like a good / valuable / fun / non-trivial problem to tackle.  If we allow connections
between processors to be auto-balanced then it will make for a pretty smooth experience as
users won't really have to think too much about it.
> {quote}

This message was sent by Atlassian JIRA

View raw message