Mailing-List: contact jira-help@kafka.apache.org; run by ezmlm
Precedence: bulk
Reply-To: jira@kafka.apache.org
Date: Wed, 6 Sep 2017 18:38:01 +0000 (UTC)
From: "Ewen Cheslack-Postava (JIRA)" <jira@apache.org>
To: jira@kafka.apache.org
Message-ID: <JIRA.13095075.1502903861000.42304.1504723081060@Atlassian.JIRA>
In-Reply-To: <JIRA.13095075.1502903861000@Atlassian.JIRA>
References: <JIRA.13095075.1502903861000@Atlassian.JIRA> <JIRA.13095075.1502903861846@jira-lw-us.apache.org>
Subject: [jira] [Commented] (KAFKA-5741) Prioritize threads in Connect
 distributed worker process
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Wed, 06 Sep 2017 18:38:15 -0000


    [ https://issues.apache.org/jira/browse/KAFKA-5741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155857#comment-16155857 ] 

Ewen Cheslack-Postava commented on KAFKA-5741:
----------------------------------------------

It would be good to have clear indications this is actually a problem in practice and that other threads starving the herder thread caused it to rebalance. First, heartbeating actually happens in a background thread, so you'd have to starve that thread as well for the session timeout. And the actual processing done in the thread is very minimal, so you'd have to completely starve that thread for a long time -- it's much more likely that things like waiting for other threads to flush data during a rebalance is what causes it to fall out of the group.

I'm also skeptical of the prioritization because to me, if this really occurred for this reason, it would suggest that the hardware is just underprovisioned for the workload. Prioritizing the DistributedHerder thread would probably just end up starving other threads if there really is that much resource contention, and so the connectors won't even really be functioning correctly anyway...

> Prioritize threads in Connect distributed worker process
> --------------------------------------------------------
>
>                 Key: KAFKA-5741
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5741
>             Project: Kafka
>          Issue Type: Improvement
>          Components: KafkaConnect
>    Affects Versions: 0.11.0.0
>            Reporter: Randall Hauch
>            Priority: Critical
>
> Connect's distributed worker process uses the {{DistributedHerder}} to perform all administrative operations, including: starting, stopping, pausing, resuming, reconfiguring connectors; rebalancing; etc. The {{DistributedHerder}} uses a single threaded executor service to do all this work and to do it sequentially. If this thread gets preempted for any reason (e.g., connector tasks are bogging down the process, DoS, etc.), then the herder's membership in the group may be dropped, causing a rebalance.
> This herder thread should be run at a much higher priority than all of the other threads in the system.


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)