From jira-return-9840-archive-asf-public=cust-asf.ponee.io@kafka.apache.org Fri Feb 9 18:00:10 2018 Return-Path: X-Original-To: archive-asf-public@eu.ponee.io Delivered-To: archive-asf-public@eu.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by mx-eu-01.ponee.io (Postfix) with ESMTP id 341AC180654 for ; Fri, 9 Feb 2018 18:00:10 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 231D4160C4C; Fri, 9 Feb 2018 17:00:10 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 699F2160C2E for ; Fri, 9 Feb 2018 18:00:09 +0100 (CET) Received: (qmail 36458 invoked by uid 500); 9 Feb 2018 17:00:08 -0000 Mailing-List: contact jira-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@kafka.apache.org Delivered-To: mailing list jira@kafka.apache.org Received: (qmail 36447 invoked by uid 99); 9 Feb 2018 17:00:08 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Feb 2018 17:00:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 237A71A09DB for ; Fri, 9 Feb 2018 17:00:08 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -101.511 X-Spam-Level: X-Spam-Status: No, score=-101.511 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id Gdn2ukHNdbN8 for ; Fri, 9 Feb 2018 17:00:06 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 844685FD36 for ; Fri, 9 Feb 2018 17:00:05 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id B3220E0144 for ; Fri, 9 Feb 2018 17:00:04 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 54B4A24781 for ; Fri, 9 Feb 2018 17:00:01 +0000 (UTC) Date: Fri, 9 Feb 2018 17:00:01 +0000 (UTC) From: "Randall Hauch (JIRA)" To: jira@kafka.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (KAFKA-5896) Kafka Connect task threads never interrupted MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/KAFKA-5896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16358670#comment-16358670 ] Randall Hauch edited comment on KAFKA-5896 at 2/9/18 4:59 PM: -------------------------------------------------------------- [~ewencp], do you have any thoughts on this? I know in the past you've talked about not wanting to do this since some developers won't properly implement the interruption. I agree that not everyone implements it correctly, but we could at least _try_ to cancel the tasks. And this latest PR seems like a good approach. was (Author: rhauch): [~ewencp], do you have any thoughts on this? I know in the past you've talked about not wanting to do this since some developers won't properly implement the interruption. I agree that not everyone implements it correctly, but we could at least _try_ to cancel the tasks. > Kafka Connect task threads never interrupted > -------------------------------------------- > > Key: KAFKA-5896 > URL: https://issues.apache.org/jira/browse/KAFKA-5896 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect > Reporter: Nick Pillitteri > Assignee: Nick Pillitteri > Priority: Minor > > h2. Problem > Kafka Connect tasks associated with connectors are run in their own threads. When tasks are stopped or restarted, a flag is set - {{stopping}} - to indicate the task should stop processing records. However, if the thread the task is running in is blocked (waiting for a lock or performing I/O) it's possible the task will never stop. > I've created a connector specifically to demonstrate this issue (along with some more detailed instructions for reproducing the issue): https://github.com/smarter-travel-media/hang-connector > I believe this is an issue because it means that a single badly behaved connector (any connector that does I/O without timeouts) can cause the Kafka Connect worker to get into a state where the only solution is to restart the JVM. > I think, but couldn't reproduce, that this is the cause of this problem on Stack Overflow: https://stackoverflow.com/questions/43802156/inconsistent-connector-state-connectexception-task-already-exists-in-this-work > h2. Expected Result > I would expect the Worker to eventually interrupt the thread that the task is running in. In the past across various other libraries, this is what I've seen done when a thread needs to be forcibly stopped. > h2. Actual Result > In actuality, the Worker sets a {{stopping}} flag and lets the thread run indefinitely. It uses a timeout while waiting for the task to stop but after this timeout has expired it simply sets a {{cancelled}} flag. This means that every time a task is restarted, a new thread running the task will be created. Thus a task may end up with multiple instances all running in their own threads when there's only supposed to be a single thread. > h2. Steps to Reproduce > The problem can be replicated by using the connector available here: https://github.com/smarter-travel-media/hang-connector > Apologies for how involved the steps are. > I've created a patch that forcibly interrupts threads after they fail to gracefully shutdown here: https://github.com/smarter-travel-media/kafka/commit/295c747a9fd82ee8b30556c89c31e0bfcce5a2c5 > I've confirmed that this fixes the issue. I can add some unit tests and submit a PR if people agree that this is a bug and interrupting threads is the right fix. > Thanks! -- This message was sent by Atlassian JIRA (v7.6.3#76005)