Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 179DE200CFD for ; Wed, 6 Sep 2017 20:38:15 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 165821609E2; Wed, 6 Sep 2017 18:38:15 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 668FE1609BB for ; Wed, 6 Sep 2017 20:38:14 +0200 (CEST) Received: (qmail 47367 invoked by uid 500); 6 Sep 2017 18:38:12 -0000 Mailing-List: contact jira-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@kafka.apache.org Delivered-To: mailing list jira@kafka.apache.org Received: (qmail 47356 invoked by uid 99); 6 Sep 2017 18:38:12 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Sep 2017 18:38:12 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id E2DC318B482 for ; Wed, 6 Sep 2017 18:38:11 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id Biqq3Dnv1ED5 for ; Wed, 6 Sep 2017 18:38:07 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id C30F961282 for ; Wed, 6 Sep 2017 18:38:02 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 04E21E0F03 for ; Wed, 6 Sep 2017 18:38:02 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 0F58D24176 for ; Wed, 6 Sep 2017 18:38:01 +0000 (UTC) Date: Wed, 6 Sep 2017 18:38:01 +0000 (UTC) From: "Ewen Cheslack-Postava (JIRA)" To: jira@kafka.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (KAFKA-5741) Prioritize threads in Connect distributed worker process MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 06 Sep 2017 18:38:15 -0000 [ https://issues.apache.org/jira/browse/KAFKA-5741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155857#comment-16155857 ] Ewen Cheslack-Postava commented on KAFKA-5741: ---------------------------------------------- It would be good to have clear indications this is actually a problem in practice and that other threads starving the herder thread caused it to rebalance. First, heartbeating actually happens in a background thread, so you'd have to starve that thread as well for the session timeout. And the actual processing done in the thread is very minimal, so you'd have to completely starve that thread for a long time -- it's much more likely that things like waiting for other threads to flush data during a rebalance is what causes it to fall out of the group. I'm also skeptical of the prioritization because to me, if this really occurred for this reason, it would suggest that the hardware is just underprovisioned for the workload. Prioritizing the DistributedHerder thread would probably just end up starving other threads if there really is that much resource contention, and so the connectors won't even really be functioning correctly anyway... > Prioritize threads in Connect distributed worker process > -------------------------------------------------------- > > Key: KAFKA-5741 > URL: https://issues.apache.org/jira/browse/KAFKA-5741 > Project: Kafka > Issue Type: Improvement > Components: KafkaConnect > Affects Versions: 0.11.0.0 > Reporter: Randall Hauch > Priority: Critical > > Connect's distributed worker process uses the {{DistributedHerder}} to perform all administrative operations, including: starting, stopping, pausing, resuming, reconfiguring connectors; rebalancing; etc. The {{DistributedHerder}} uses a single threaded executor service to do all this work and to do it sequentially. If this thread gets preempted for any reason (e.g., connector tasks are bogging down the process, DoS, etc.), then the herder's membership in the group may be dropped, causing a rebalance. > This herder thread should be run at a much higher priority than all of the other threads in the system. -- This message was sent by Atlassian JIRA (v6.4.14#64029)