From issues-return-392661-archive-asf-public=cust-asf.ponee.io@flink.apache.org Mon Aug 24 08:05:03 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mailroute1-lw-us.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 98D8718037A for ; Mon, 24 Aug 2020 10:05:03 +0200 (CEST) Received: from mail.apache.org (localhost [127.0.0.1]) by mailroute1-lw-us.apache.org (ASF Mail Server at mailroute1-lw-us.apache.org) with SMTP id CEB20123AD9 for ; Mon, 24 Aug 2020 08:05:02 +0000 (UTC) Received: (qmail 88005 invoked by uid 500); 24 Aug 2020 08:05:02 -0000 Mailing-List: contact issues-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list issues@flink.apache.org Received: (qmail 87996 invoked by uid 99); 24 Aug 2020 08:05:02 -0000 Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 24 Aug 2020 08:05:02 +0000 Received: from jira-he-de.apache.org (static.172.67.40.188.clients.your-server.de [188.40.67.172]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 6849942FFA for ; Mon, 24 Aug 2020 08:05:01 +0000 (UTC) Received: from jira-he-de.apache.org (localhost.localdomain [127.0.0.1]) by jira-he-de.apache.org (ASF Mail Server at jira-he-de.apache.org) with ESMTP id 11CC5780AD9 for ; Mon, 24 Aug 2020 08:05:00 +0000 (UTC) Date: Mon, 24 Aug 2020 08:05:00 +0000 (UTC) From: "Till Rohrmann (Jira)" To: issues@flink.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (FLINK-18959) Fail to archiveExecutionGraph because job is not finished when dispatcher close MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/FLINK-18959?page=3Dcom.atlassia= n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D171= 83037#comment-17183037 ]=20 Till Rohrmann commented on FLINK-18959: --------------------------------------- [~ZhuShang] I believe that this is unrelated since the problem of this tick= et should only affect the per-job cluster deployments (deployments which us= e the {{MiniDispatcher}}). > Fail to archiveExecutionGraph because job is not finished when dispatcher= close > -------------------------------------------------------------------------= ------ > > Key: FLINK-18959 > URL: https://issues.apache.org/jira/browse/FLINK-18959 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.10.0, 1.12.0, 1.11.1 > Reporter: Liu > Priority: Critical > Fix For: 1.12.0, 1.11.2, 1.10.3 > > Attachments: flink-debug-log > > > When job is cancelled, we expect to see it in flink's history server. But= I can not see my job after it is cancelled. > After digging into the problem, I find that the function archiveExecution= Graph is not executed. Below is the brief log: > {panel:title=3Dlog} > 2020-08-14 15:10:06,406 INFO org.apache.flink.runtime.executiongraph.Exec= utionGraph [flink-akka.actor.default-dispatcher- 15] - Job EtlAndWindow (6f= 784d4cc5bae88a332d254b21660372) switched from state RUNNING to CANCELLING. > 2020-08-14 15:10:06,415 DEBUG org.apache.flink.runtime.dispatcher.MiniDis= patcher [flink-akka.actor.default-dispatcher-3] - Shutting down per-job clu= ster because the job was canceled. > 2020-08-14 15:10:06,629 INFO org.apache.flink.runtime.dispatcher.MiniDisp= atcher [flink-akka.actor.default-dispatcher-3] - Stopping dispatcher akka.t= cp://flink@bjfk-c9865.yz02:38663/user/dispatcher. > 2020-08-14 15:10:06,629 INFO org.apache.flink.runtime.dispatcher.MiniDisp= atcher [flink-akka.actor.default-dispatcher-3] - Stopping all currently run= ning jobs of dispatcher akka.tcp://flink@bjfk-c9865.yz02:38663/user/dispatc= her. > 2020-08-14 15:10:06,631 INFO org.apache.flink.runtime.jobmaster.JobMaster= [flink-akka.actor.default-dispatcher-29] - Stopping the JobMaster for job = EtlAndWindow(6f784d4cc5bae88a332d254b21660372). > 2020-08-14 15:10:06,632 DEBUG org.apache.flink.runtime.jobmaster.JobMaste= r [flink-akka.actor.default-dispatcher-29] - Disconnect TaskExecutor contai= ner_e144_1590060720089_2161_01_000006 because: Stopping JobMaster for job E= tlAndWindow(6f784d4cc5bae88a332d254b21660372). > 2020-08-14 15:10:06,646 INFO org.apache.flink.runtime.executiongraph.Exec= utionGraph [flink-akka.actor.default-dispatcher-29] - Job EtlAndWindow (6f7= 84d4cc5bae88a332d254b21660372) switched from state CANCELLING to CANCELED. > 2020-08-14 15:10:06,664 DEBUG org.apache.flink.runtime.dispatcher.MiniDis= patcher [flink-akka.actor.default-dispatcher-4] - There is a newer JobManag= erRunner for the job 6f784d4cc5bae88a332d254b21660372. > {panel} > From the log, we can see that=C2=A0job is not finished when dispatcher cl= oses. The process is as following: > * Receive cancel command and send it to all tasks async. > * In=C2=A0MiniDispatcher, begin to shutting down per-job cluster. > * Stopping dispatcher and remove job. > * Job is cancelled and=C2=A0callback is executed in method=C2=A0startJob= ManagerRunner. > * Because job is removed before, so=C2=A0currentJobManagerRunner is null= which not equals to the original jobManagerRunner. In this case,=C2=A0arch= ivedExecutionGraph will not be uploaded. > In normal cases, I find that job is cancelled first and then dispatcher i= s stopped so that archivedExecutionGraph will succeed. But I think that the= order is not constrained and it is hard to know which comes first.=C2=A0 > Above is what I=C2=A0suspected. If so, then we should fix it. > =C2=A0 -- This message was sent by Atlassian Jira (v8.3.4#803005)