From issues-return-193000-archive-asf-public=cust-asf.ponee.io@flink.apache.org  Tue Oct  9 09:50:05 2018
Return-Path: <issues-return-193000-archive-asf-public=cust-asf.ponee.io@flink.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 185A318067A
	for <archive-asf-public@cust-asf.ponee.io>; Tue,  9 Oct 2018 09:50:04 +0200 (CEST)
Received: (qmail 87964 invoked by uid 500); 9 Oct 2018 07:50:04 -0000
Mailing-List: contact issues-help@flink.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:issues-help@flink.apache.org>
List-Unsubscribe: <mailto:issues-unsubscribe@flink.apache.org>
List-Post: <mailto:issues@flink.apache.org>
List-Id: <issues.flink.apache.org>
Reply-To: dev@flink.apache.org
Delivered-To: mailing list issues@flink.apache.org
Received: (qmail 87955 invoked by uid 99); 9 Oct 2018 07:50:04 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Oct 2018 07:50:04 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 97286C8507
	for <issues@flink.apache.org>; Tue,  9 Oct 2018 07:50:03 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: -110.301
X-Spam-Level:
X-Spam-Status: No, score=-110.301 tagged_above=-999 required=6.31
	tests=[ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_MED=-2.3,
	SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100]
	autolearn=disabled
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024)
	with ESMTP id m5Kt4EMVZXEi for <issues@flink.apache.org>;
	Tue,  9 Oct 2018 07:50:02 +0000 (UTC)
Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id DFFEB5F490
	for <issues@flink.apache.org>; Tue,  9 Oct 2018 07:50:01 +0000 (UTC)
Received: from jira-lw-us.apache.org (unknown [207.244.88.139])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id D1A37E0F4C
	for <issues@flink.apache.org>; Tue,  9 Oct 2018 07:50:00 +0000 (UTC)
Received: from jira-lw-us.apache.org (localhost [127.0.0.1])
	by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 4A2992480E
	for <issues@flink.apache.org>; Tue,  9 Oct 2018 07:50:00 +0000 (UTC)
Date: Tue, 9 Oct 2018 07:50:00 +0000 (UTC)
From: "Till Rohrmann (JIRA)" <jira@apache.org>
To: issues@flink.apache.org
Message-ID: <JIRA.13171109.1531218459000.67323.1539071400302@Atlassian.JIRA>
In-Reply-To: <JIRA.13171109.1531218459000@Atlassian.JIRA>
References: <JIRA.13171109.1531218459000@Atlassian.JIRA> <JIRA.13171109.1531218459125@jira-lw-us.apache.org>
Subject: [jira] [Commented] (FLINK-9788) ExecutionGraph Inconsistency
 prevents Job from recovering
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


    [ https://issues.apache.org/jira/browse/FLINK-9788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16642919#comment-16642919 ] 

Till Rohrmann commented on FLINK-9788:
--------------------------------------

Arg, this sounds quite bad. Thanks a lot for diagnosing the problem [~SleePy]. We should definitely fix this problem for 1.7. I'll mark it as a blocker.

I could think of two high level solutions here:
1. Ignore failures if one is in state RESTARTING because it must originate from the previous run. Here we need to check whether {{failGlobal}} is really only called by a running {{ExecutionGraph}}
2. Cancel the subsumed restarting operation such that eventually the latest restarting operation will succeed.

> ExecutionGraph Inconsistency prevents Job from recovering
> ---------------------------------------------------------
>
>                 Key: FLINK-9788
>                 URL: https://issues.apache.org/jira/browse/FLINK-9788
>             Project: Flink
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.6.0
>         Environment: Rev: 4a06160
> Hadoop 2.8.3
>            Reporter: Gary Yao
>            Priority: Critical
>             Fix For: 1.7.0, 1.6.2
>
>         Attachments: jobmanager_5000.log
>
>
> Deployment mode: YARN job mode with HA
> After killing many TaskManagers in succession, the state of the ExecutionGraph ran into an inconsistent state, which prevented job recovery. The following stacktrace was logged in the JobManager log several hundred times per second:
> {noformat}
> -08 16:47:18,855 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job General purpose test job (37a794195840700b98feb23e99f7ea24) switched from state RESTARTING to RESTARTING.
> 2018-07-08 16:47:18,856 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Restarting the job General purpose test job (37a794195840700b98feb23e99f7ea24).
> 2018-07-08 16:47:18,857 DEBUG org.apache.flink.runtime.executiongraph.ExecutionGraph        - Resetting execution vertex Source: Custom Source -> Timestamps/Watermarks (1/10) for new execution.
> 2018-07-08 16:47:18,857 WARN  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Failed to restart the job.
> java.lang.IllegalStateException: Cannot reset a vertex that is in non-terminal state CREATED
>         at org.apache.flink.runtime.executiongraph.ExecutionVertex.resetForNewExecution(ExecutionVertex.java:610)
>         at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.resetForNewExecution(ExecutionJobVertex.java:573)
>         at org.apache.flink.runtime.executiongraph.ExecutionGraph.restart(ExecutionGraph.java:1251)
>         at org.apache.flink.runtime.executiongraph.restart.ExecutionGraphRestartCallback.triggerFullRecovery(ExecutionGraphRestartCallback.java:59)
>         at org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy$1.run(FixedDelayRestartStrategy.java:68)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {noformat}
> The resulting jobmanager log file was 4.7 GB in size. Find attached the first 5000 lines of the log file. 


--
This message was sent by Atlassian JIRA
(v7.6.3#76005)