From issues-return-369695-archive-asf-public=cust-asf.ponee.io@flink.apache.org Mon Jun 1 09:24:03 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 4FDA61804BB for ; Mon, 1 Jun 2020 11:24:03 +0200 (CEST) Received: (qmail 47013 invoked by uid 500); 1 Jun 2020 09:24:02 -0000 Mailing-List: contact issues-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list issues@flink.apache.org Received: (qmail 47004 invoked by uid 99); 1 Jun 2020 09:24:02 -0000 Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 Jun 2020 09:24:02 +0000 Received: from jira-he-de.apache.org (static.172.67.40.188.clients.your-server.de [188.40.67.172]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 36051E2CF8 for ; Mon, 1 Jun 2020 09:24:00 +0000 (UTC) Received: from jira-he-de.apache.org (localhost.localdomain [127.0.0.1]) by jira-he-de.apache.org (ASF Mail Server at jira-he-de.apache.org) with ESMTP id 229EE780309 for ; Mon, 1 Jun 2020 09:24:00 +0000 (UTC) Date: Mon, 1 Jun 2020 09:24:00 +0000 (UTC) From: "Nicholas Jiang (Jira)" To: issues@flink.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (FLINK-17726) Scheduler should take care of tasks directly canceled by TaskManager MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17120883#comment-17120883 ] Nicholas Jiang commented on FLINK-17726: ---------------------------------------- [~zhuzh]I have double checked the methods that trigger cancel task operation, including cancelOrFailAndCancelInvokable, cancelExecution which are based on TaskCanceler, cancelInvokable which is based on invokable's cancel method and caller of method transitionState. After checking again, the case that a directly CANCELED task happens when its upstream task was canceled/failed doesn't exist. IMO, my solution would be to modify tasks that transitions to CANCELED from all states except from CANCELING to FAILED status as same as the solution you mentioned. What do you think about ? > Scheduler should take care of tasks directly canceled by TaskManager > -------------------------------------------------------------------- > > Key: FLINK-17726 > URL: https://issues.apache.org/jira/browse/FLINK-17726 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.11.0, 1.12.0 > Reporter: Zhu Zhu > Assignee: Nicholas Jiang > Priority: Critical > Fix For: 1.11.0, 1.12.0 > > > JobManager will not trigger failure handling when receiving CANCELED task update. > This is because CANCELED tasks are usually caused by another FAILED task. These CANCELED tasks will be restarted by the failover process triggered FAILED task. > However, if a task is directly CANCELED by TaskManager due to its own runtime issue, the task will not be recovered by JM and thus the job would hang. > This is a potential issue and we should avoid it. > A possible solution is to let JobManager treat tasks transitioning to CANCELED from all states except from CANCELING as failed tasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)