From issues-return-86851-archive-asf-public=cust-asf.ponee.io@nifi.apache.org  Thu Oct 17 17:19:02 2019
Return-Path: <issues-return-86851-archive-asf-public=cust-asf.ponee.io@nifi.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 5C5DE180657
	for <archive-asf-public@cust-asf.ponee.io>; Thu, 17 Oct 2019 19:19:02 +0200 (CEST)
Received: (qmail 73974 invoked by uid 500); 17 Oct 2019 17:19:01 -0000
Mailing-List: contact issues-help@nifi.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:issues-help@nifi.apache.org>
List-Unsubscribe: <mailto:issues-unsubscribe@nifi.apache.org>
List-Post: <mailto:issues@nifi.apache.org>
List-Id: <issues.nifi.apache.org>
Reply-To: dev@nifi.apache.org
Delivered-To: mailing list issues@nifi.apache.org
Received: (qmail 73963 invoked by uid 99); 17 Oct 2019 17:19:01 -0000
Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139)
    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Oct 2019 17:19:01 +0000
Received: from jira-he-de.apache.org (static.172.67.40.188.clients.your-server.de [188.40.67.172])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id E11BBE2F4F
	for <issues@nifi.apache.org>; Thu, 17 Oct 2019 17:19:00 +0000 (UTC)
Received: from jira-he-de.apache.org (localhost.localdomain [127.0.0.1])
	by jira-he-de.apache.org (ASF Mail Server at jira-he-de.apache.org) with ESMTP id 611AD7803DB
	for <issues@nifi.apache.org>; Thu, 17 Oct 2019 17:19:00 +0000 (UTC)
Date: Thu, 17 Oct 2019 17:19:00 +0000 (UTC)
From: "Mark Payne (Jira)" <jira@apache.org>
To: issues@nifi.apache.org
Message-ID: <JIRA.13262899.1571329471000.83395.1571332740393@Atlassian.JIRA>
In-Reply-To: <JIRA.13262899.1571329471000@Atlassian.JIRA>
References: <JIRA.13262899.1571329471000@Atlassian.JIRA> <JIRA.13262899.1571329471249@jira-he-de>
Subject: [jira] [Commented] (NIFI-6787) Load balanced connection can hold up
 for whole cluster when one node slows down
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


    [ https://issues.apache.org/jira/browse/NIFI-6787?page=3Dcom.atlassian.=
jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D16953=
938#comment-16953938 ]=20

Mark Payne commented on NIFI-6787:
----------------------------------

Hey [~tpalfy]=C2=A0this is certainly something that is important to underst=
and and ensure that it is working optimally.

The statement "when checking if it's full or not, it checks the=C2=A0_total=
_=C2=A0size (which is across the whole cluster) and compares it to the max =
(which is scoped to the current node)" though is not exactly accurate. The =
difference here between looking at the total queue size vs. local queue siz=
e is important, and perhaps the naming convention used in the code is a bit=
 confusing, so let me try to explain that better.

The queue is divided into N "partitions" where N =3D the number of nodes in=
 the cluster (+1 for the Rebalancing Partition, but that doesn't really com=
e into play here). This is a "local partition" and N-1 "Remote partitions".=
 A "Remote Partition" is NOT a partition that lives on another node, but ra=
ther a partition that is holding FlowFiles waiting to be pushed to another =
node.

A Processor (or port or funnel) can only pull data from the Local Partition=
. It does not pull a FlowFile that is waiting to be pushed to another node =
in the cluster. So this is why we check if active queue is empty or not on =
the local partition.

But when we consider whether or not a queue is full, in order to provide ba=
ckpressure, what we are looking at is not the FlowFiles across the entire c=
luster - we are looking at the FlowFiles in the Local Partition + the FlowF=
iles on this node that are waiting to go to other nodes in the cluster. It'=
s important to take this into account, because if we look only at whether o=
r not the Local Partition is full, we can get into a situation where backpr=
essure never is appropriately applied. Consider, for example, if the Partit=
ioner is set to Load Balance to a single Node in the cluster. Then, on most=
 nodes, all data ends up in the RemoteQueuePartition, not in the Local Part=
ition. We still want to have backpressure applied in this instance.

If there is one slow node in the cluster, all other nodes should still beha=
ve nicely with one another, though they will of course be limited in how we=
ll they can distributed data between themselves and the slow node. I wonder=
 if perhaps what you are running into can be described by NIFI-6353, NIFI-6=
517, or NIFI-6736? If not, can you attach a template to help illustrate the=
 issue?

> Load balanced connection can hold up for whole cluster when one node slow=
s down
> -------------------------------------------------------------------------=
------
>
>                 Key: NIFI-6787
>                 URL: https://issues.apache.org/jira/browse/NIFI-6787
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Core Framework
>            Reporter: Tamas Palfy
>            Assignee: Tamas Palfy
>            Priority: Major
>
> A slow processor on one node in a cluster can slow down the same processo=
r on the other nodes when load balanced incoming connection is used:
> When a processors decides wether it can pass along a flowfile, it checks =
whether the outgoing connection is full or not (if full, it yields).
> Similarly when the receiving processor is to be scheduled the framework c=
hecks if the incoming connection is empty (if empty, no reason to call {{on=
Trigger}}).
> The problem is that when checking if it's full or not, it checks the _tot=
al_ size (which is across the whole cluster) and compares it to the max (wh=
ich is scoped to the current node).
> The empty check is (correctly) done on the local partition only.
> This can lead to the case where the slow node fills up its queue while th=
e faster ones empty theirs.
> Once the slow node has a full queue, the fast ones stop receiving new inp=
ut and thus stop working after their queues get emptied.
> The issue is probably the fact that {{SocketLoadBalancedFlowFileQueue.isF=
ull}} actually calls to {{AbstractFlowFileQueue.isFull}} which checks {{siz=
e()}}, but that returns the _total_ size. (The empty check looks fine but f=
or reference it is done via {{SocketLoadBalancedFlowFileQueue.isActiveQueue=
Empty}}).
> {code:title=3DAbstractFlowFileQueue.java|borderStyle=3Dsolid}
> ...
>     private MaxQueueSize getMaxQueueSize() {
>         return maxQueueSize.get();
>     }
>     @Override
>     public boolean isFull() {
>         return isFull(size());
>     }
>     protected boolean isFull(final QueueSize queueSize) {
>         final MaxQueueSize maxSize =3D getMaxQueueSize();
>         // Check if max size is set
>         if (maxSize.getMaxBytes() <=3D 0 && maxSize.getMaxCount() <=3D 0)=
 {
>             return false;
>         }
>         if (maxSize.getMaxCount() > 0 && queueSize.getObjectCount() >=3D =
maxSize.getMaxCount()) {
>             return true;
>         }
>         if (maxSize.getMaxBytes() > 0 && queueSize.getByteCount() >=3D ma=
xSize.getMaxBytes()) {
>             return true;
>         }
>         return false;
>     }
> ...
> {code}
> {code:title=3DSocketLoadBalancedFlowFileQueue.java|borderStyle=3Dsolid}
> ...
>     @Override
>     public QueueSize size() {
>         return totalSize.get();
>     }
> ...
>     @Override
>     public boolean isActiveQueueEmpty() {
>         return localPartition.isActiveQueueEmpty();
>     }
> ...
> {code}


--
This message was sent by Atlassian Jira
(v8.3.4#803005)