Return-Path: X-Original-To: apmail-flink-issues-archive@minotaur.apache.org Delivered-To: apmail-flink-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 78A7218677 for ; Wed, 10 Feb 2016 16:56:19 +0000 (UTC) Received: (qmail 78070 invoked by uid 500); 10 Feb 2016 16:56:19 -0000 Delivered-To: apmail-flink-issues-archive@flink.apache.org Received: (qmail 77757 invoked by uid 500); 10 Feb 2016 16:56:18 -0000 Mailing-List: contact issues-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list issues@flink.apache.org Received: (qmail 77685 invoked by uid 99); 10 Feb 2016 16:56:18 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Feb 2016 16:56:18 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 770FA2C1F71 for ; Wed, 10 Feb 2016 16:56:18 +0000 (UTC) Date: Wed, 10 Feb 2016 16:56:18 +0000 (UTC) From: "Stephan Ewen (JIRA)" To: issues@flink.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Closed] (FLINK-3260) ExecutionGraph gets stuck in state FAILING MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/FLINK-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Ewen closed FLINK-3260. ------------------------------- > ExecutionGraph gets stuck in state FAILING > ------------------------------------------ > > Key: FLINK-3260 > URL: https://issues.apache.org/jira/browse/FLINK-3260 > Project: Flink > Issue Type: Bug > Components: JobManager > Affects Versions: 0.10.1 > Reporter: Stephan Ewen > Assignee: Till Rohrmann > Priority: Blocker > Fix For: 1.0.0 > > > It is a bit of a rare case, but the following can currently happen: > 1. Jobs runs for a while, some tasks are already finished. > 2. Job fails, goes to state failing and restarting. Non-finished tasks fail or are canceled. > 3. For the finished tasks, ask-futures from certain messages (for example for releasing intermediate result partitions) can fail (timeout) and cause the execution to go from FINISHED to FAILED > 4. This triggers the execution graph to go to FAILING without ever going further into RESTARTING again > 5. The job is stuck > It initially looks like this is mainly an issue for batch jobs (jobs where tasks do finish, rather than run infinitely). > The log that shows how this manifests: > {code} > -------------------------------------------------------------------------------- > 17:19:19,782 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started > 17:19:19,844 INFO Remoting - Starting remoting > 17:19:20,065 INFO Remoting - Remoting started; listening on addresses :[akka.tcp://flink@127.0.0.1:56722] > 17:19:20,090 INFO org.apache.flink.runtime.blob.BlobServer - Created BLOB server storage directory /tmp/blobStore-6766f51a-1c51-4a03-acfb-08c2c29c11f0 > 17:19:20,096 INFO org.apache.flink.runtime.blob.BlobServer - Started BLOB server at 0.0.0.0:43327 - max concurrent requests: 50 - max backlog: 1000 > 17:19:20,113 INFO org.apache.flink.runtime.jobmanager.MemoryArchivist - Started memory archivist akka://flink/user/archive > 17:19:20,115 INFO org.apache.flink.runtime.checkpoint.SavepointStoreFactory - No savepoint state backend configured. Using job manager savepoint state backend. > 17:19:20,118 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager at akka.tcp://flink@127.0.0.1:56722/user/jobmanager. > 17:19:20,123 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka.tcp://flink@127.0.0.1:56722/user/jobmanager was granted leadership with leader session ID None. > 17:19:25,605 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at testing-worker-linux-docker-e6d6931f-3200-linux-4 (akka.tcp://flink@172.17.0.253:43702/user/taskmanager) as f213232054587f296a12140d56f63ed1. Current number of registered hosts is 1. Current number of alive task slots is 2. > 17:19:26,758 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at testing-worker-linux-docker-e6d6931f-3200-linux-4 (akka.tcp://flink@172.17.0.253:43956/user/taskmanager) as f9e78baa14fb38c69517fb1bcf4f419c. Current number of registered hosts is 2. Current number of alive task slots is 4. > 17:19:27,064 INFO org.apache.flink.api.java.ExecutionEnvironment - The job has 0 registered types and 0 default Kryo serializers > 17:19:27,071 INFO org.apache.flink.client.program.Client - Starting client actor system > 17:19:27,072 INFO org.apache.flink.runtime.client.JobClient - Starting JobClient actor system > 17:19:27,110 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started > 17:19:27,121 INFO Remoting - Starting remoting > 17:19:27,143 INFO org.apache.flink.runtime.client.JobClient - Started JobClient actor system at 127.0.0.1:51198 > 17:19:27,145 INFO Remoting - Remoting started; listening on addresses :[akka.tcp://flink@127.0.0.1:51198] > 17:19:27,325 INFO org.apache.flink.runtime.client.JobClientActor - Disconnect from JobManager null. > 17:19:27,362 INFO org.apache.flink.runtime.client.JobClientActor - Received job Flink Java Job at Mon Jan 18 17:19:27 UTC 2016 (fa05fd25993a8742da09cc5023c1e38d). > 17:19:27,362 INFO org.apache.flink.runtime.client.JobClientActor - Could not submit job Flink Java Job at Mon Jan 18 17:19:27 UTC 2016 (fa05fd25993a8742da09cc5023c1e38d), because there is no connection to a JobManager. > 17:19:27,379 INFO org.apache.flink.runtime.client.JobClientActor - Connect to JobManager Actor[akka.tcp://flink@127.0.0.1:56722/user/jobmanager#-1489998809]. > 17:19:27,379 INFO org.apache.flink.runtime.client.JobClientActor - Connected to new JobManager akka.tcp://flink@127.0.0.1:56722/user/jobmanager. > 17:19:27,379 INFO org.apache.flink.runtime.client.JobClientActor - Sending message to JobManager akka.tcp://flink@127.0.0.1:56722/user/jobmanager to submit job Flink Java Job at Mon Jan 18 17:19:27 UTC 2016 (fa05fd25993a8742da09cc5023c1e38d) and wait for progress > 17:19:27,380 INFO org.apache.flink.runtime.client.JobClientActor - Upload jar files to job manager akka.tcp://flink@127.0.0.1:56722/user/jobmanager. > 17:19:27,380 INFO org.apache.flink.runtime.client.JobClientActor - Submit job to the job manager akka.tcp://flink@127.0.0.1:56722/user/jobmanager. > 17:19:27,453 INFO org.apache.flink.runtime.jobmanager.JobManager - Submitting job fa05fd25993a8742da09cc5023c1e38d (Flink Java Job at Mon Jan 18 17:19:27 UTC 2016). > 17:19:27,591 INFO org.apache.flink.runtime.jobmanager.JobManager - Scheduling job fa05fd25993a8742da09cc5023c1e38d (Flink Java Job at Mon Jan 18 17:19:27 UTC 2016). > 17:19:27,592 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (1/4) (c79bf4381462c690f5999f2d1949ab50) switched from CREATED to SCHEDULED > 17:19:27,596 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (1/4) (c79bf4381462c690f5999f2d1949ab50) switched from SCHEDULED to DEPLOYING > 17:19:27,597 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (1/4) (attempt #0) to testing-worker-linux-docker-e6d6931f-3200-linux-4 > 17:19:27,606 INFO org.apache.flink.runtime.client.JobClientActor - Job was successfully submitted to the JobManager akka.tcp://flink@127.0.0.1:56722/user/jobmanager. > 17:19:27,630 INFO org.apache.flink.runtime.jobmanager.JobManager - Status of job fa05fd25993a8742da09cc5023c1e38d (Flink Java Job at Mon Jan 18 17:19:27 UTC 2016) changed to RUNNING. > 17:19:27,637 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (2/4) (e73af91028cb76f7d3cd887cb6d66755) switched from CREATED to SCHEDULED > 17:19:27,654 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:27 Job execution switched to status RUNNING. > 17:19:27,655 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:27 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(1/4) switched to SCHEDULED > 17:19:27,656 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:27 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(1/4) switched to DEPLOYING > 17:19:27,666 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (2/4) (e73af91028cb76f7d3cd887cb6d66755) switched from SCHEDULED to DEPLOYING > 17:19:27,667 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (2/4) (attempt #0) to testing-worker-linux-docker-e6d6931f-3200-linux-4 > 17:19:27,667 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:27 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(2/4) switched to SCHEDULED > 17:19:27,669 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:27 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(2/4) switched to DEPLOYING > 17:19:27,681 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (3/4) (807daf978da9dc347dca930822c78f8f) switched from CREATED to SCHEDULED > 17:19:27,682 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (3/4) (807daf978da9dc347dca930822c78f8f) switched from SCHEDULED to DEPLOYING > 17:19:27,682 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (3/4) (attempt #0) to testing-worker-linux-docker-e6d6931f-3200-linux-4 > 17:19:27,682 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (4/4) (ba45c37065b67fc8f5005a50d0e88fff) switched from CREATED to SCHEDULED > 17:19:27,682 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (4/4) (ba45c37065b67fc8f5005a50d0e88fff) switched from SCHEDULED to DEPLOYING > 17:19:27,685 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (4/4) (attempt #0) to testing-worker-linux-docker-e6d6931f-3200-linux-4 > 17:19:27,686 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:27 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(3/4) switched to SCHEDULED > 17:19:27,687 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:27 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(3/4) switched to DEPLOYING > 17:19:27,687 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:27 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(4/4) switched to SCHEDULED > 17:19:27,692 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:27 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(4/4) switched to DEPLOYING > 17:19:27,833 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (4/4) (ba45c37065b67fc8f5005a50d0e88fff) switched from DEPLOYING to RUNNING > 17:19:27,839 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:27 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(4/4) switched to RUNNING > 17:19:27,840 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (2/4) (e73af91028cb76f7d3cd887cb6d66755) switched from DEPLOYING to RUNNING > 17:19:27,852 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:27 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(2/4) switched to RUNNING > 17:19:27,896 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (1/4) (c79bf4381462c690f5999f2d1949ab50) switched from DEPLOYING to RUNNING > 17:19:27,898 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (3/4) (807daf978da9dc347dca930822c78f8f) switched from DEPLOYING to RUNNING > 17:19:27,901 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:27 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(1/4) switched to RUNNING > 17:19:27,905 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:27 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(3/4) switched to RUNNING > 17:19:28,114 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (3/4) (7997918330ecf2610b3298a8c8ef2852) switched from CREATED to SCHEDULED > 17:19:28,126 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (1/4) (6421c8f88b191ea844619a40a523773b) switched from CREATED to SCHEDULED > 17:19:28,134 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (1/4) (6421c8f88b191ea844619a40a523773b) switched from SCHEDULED to DEPLOYING > 17:19:28,134 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (1/4) (attempt #0) to testing-worker-linux-docker-e6d6931f-3200-linux-4 > 17:19:28,126 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (2/4) (d0d011dc0a0823bcec5a57a369b334ed) switched from CREATED to SCHEDULED > 17:19:28,139 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (2/4) (d0d011dc0a0823bcec5a57a369b334ed) switched from SCHEDULED to DEPLOYING > 17:19:28,139 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (2/4) (attempt #0) to testing-worker-linux-docker-e6d6931f-3200-linux-4 > 17:19:28,117 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (4/4) (c928d19f73d700e80cdfad650689febb) switched from CREATED to SCHEDULED > 17:19:28,134 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (3/4) (7997918330ecf2610b3298a8c8ef2852) switched from SCHEDULED to DEPLOYING > 17:19:28,140 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (3/4) (attempt #0) to testing-worker-linux-docker-e6d6931f-3200-linux-4 > 17:19:28,140 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (4/4) (c928d19f73d700e80cdfad650689febb) switched from SCHEDULED to DEPLOYING > 17:19:28,141 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (4/4) (attempt #0) to testing-worker-linux-docker-e6d6931f-3200-linux-4 > 17:19:28,147 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(3/4) switched to SCHEDULED > 17:19:28,153 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(1/4) switched to SCHEDULED > 17:19:28,153 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(1/4) switched to DEPLOYING > 17:19:28,153 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(2/4) switched to SCHEDULED > 17:19:28,153 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(2/4) switched to DEPLOYING > 17:19:28,156 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(3/4) switched to DEPLOYING > 17:19:28,158 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(4/4) switched to SCHEDULED > 17:19:28,165 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(4/4) switched to DEPLOYING > 17:19:28,238 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (2/4) (e73af91028cb76f7d3cd887cb6d66755) switched from RUNNING to FINISHED > 17:19:28,242 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(2/4) switched to FINISHED > 17:19:28,308 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (3/4) (807daf978da9dc347dca930822c78f8f) switched from RUNNING to FINISHED > 17:19:28,315 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (1/4) (c79bf4381462c690f5999f2d1949ab50) switched from RUNNING to FINISHED > 17:19:28,317 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(3/4) switched to FINISHED > 17:19:28,318 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(1/4) switched to FINISHED > 17:19:28,328 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (1/4) (6421c8f88b191ea844619a40a523773b) switched from DEPLOYING to RUNNING > 17:19:28,336 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(1/4) switched to RUNNING > 17:19:28,338 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (3/4) (7997918330ecf2610b3298a8c8ef2852) switched from DEPLOYING to RUNNING > 17:19:28,341 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(3/4) switched to RUNNING > 17:19:28,459 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat)) (4/4) (ba45c37065b67fc8f5005a50d0e88fff) switched from RUNNING to FINISHED > 17:19:28,463 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 DataSource (at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73) (org.apache.flink.api.java.io.ParallelIteratorInputFormat))(4/4) switched to FINISHED > 17:19:28,520 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (4/4) (c928d19f73d700e80cdfad650689febb) switched from DEPLOYING to RUNNING > 17:19:28,529 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(4/4) switched to RUNNING > 17:19:28,540 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (2/4) (d0d011dc0a0823bcec5a57a369b334ed) switched from DEPLOYING to RUNNING > 17:19:28,545 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:28 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(2/4) switched to RUNNING > 17:19:32,384 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at testing-worker-linux-docker-e6d6931f-3200-linux-4 (akka.tcp://flink@172.17.0.253:60852/user/taskmanager) as 5848d44035a164a0302da6c8701ff748. Current number of registered hosts is 3. Current number of alive task slots is 6. > 17:19:32,598 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Reduce (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (1/1) (d0f8f69f9047c3154b860850955de20f) switched from CREATED to SCHEDULED > 17:19:32,598 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Reduce (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (1/1) (d0f8f69f9047c3154b860850955de20f) switched from SCHEDULED to DEPLOYING > 17:19:32,598 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying Reduce (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (1/1) (attempt #0) to testing-worker-linux-docker-e6d6931f-3200-linux-4 > 17:19:32,605 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:32 Reduce (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(1/1) switched to SCHEDULED > 17:19:32,605 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:32 Reduce (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(1/1) switched to DEPLOYING > 17:19:32,611 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (4/4) (c928d19f73d700e80cdfad650689febb) switched from RUNNING to FINISHED > 17:19:32,614 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:32 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(4/4) switched to FINISHED > 17:19:32,717 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (1/4) (6421c8f88b191ea844619a40a523773b) switched from RUNNING to FINISHED > 17:19:32,719 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:32 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(1/4) switched to FINISHED > 17:19:32,724 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Reduce (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (1/1) (d0f8f69f9047c3154b860850955de20f) switched from DEPLOYING to RUNNING > 17:19:32,726 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:32 Reduce (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(1/1) switched to RUNNING > 17:19:32,843 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (2/4) (d0d011dc0a0823bcec5a57a369b334ed) switched from RUNNING to FINISHED > 17:19:32,845 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:32 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(2/4) switched to FINISHED > 17:19:33,092 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@172.17.0.253:43702] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. > 17:19:39,111 WARN Remoting - Tried to associate with unreachable remote address [akka.tcp://flink@172.17.0.253:43702]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /172.17.0.253:43702 > 17:19:39,113 INFO org.apache.flink.runtime.jobmanager.JobManager - Task manager akka.tcp://flink@172.17.0.253:43702/user/taskmanager terminated. > 17:19:39,114 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (3/4) (7997918330ecf2610b3298a8c8ef2852) switched from RUNNING to FAILED > 17:19:39,120 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:39 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(3/4) switched to FAILED > java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager f213232054587f296a12140d56f63ed1 @ testing-worker-linux-docker-e6d6931f-3200-linux-4 - 2 slots - URL: akka.tcp://flink@172.17.0.253:43702/user/taskmanager > at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151) > at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547) > at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119) > at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156) > at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215) > at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:792) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) > at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) > at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) > at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) > at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) > at akka.actor.ActorCell.invoke(ActorCell.scala:486) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) > at akka.dispatch.Mailbox.run(Mailbox.scala:221) > at akka.dispatch.Mailbox.exec(Mailbox.scala:231) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 17:19:39,129 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Reduce (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (1/1) (d0f8f69f9047c3154b860850955de20f) switched from RUNNING to CANCELING > 17:19:39,132 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSink (collect()) (1/1) (895e1ea552281a665ae390c966cdb3b7) switched from CREATED to CANCELED > 17:19:39,149 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:39 Job execution switched to status FAILING. > java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager f213232054587f296a12140d56f63ed1 @ testing-worker-linux-docker-e6d6931f-3200-linux-4 - 2 slots - URL: akka.tcp://flink@172.17.0.253:43702/user/taskmanager > at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151) > at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547) > at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119) > at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156) > at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215) > at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:792) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) > at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) > at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) > at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) > at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) > at akka.actor.ActorCell.invoke(ActorCell.scala:486) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) > at akka.dispatch.Mailbox.run(Mailbox.scala:221) > at akka.dispatch.Mailbox.exec(Mailbox.scala:231) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 17:19:39,173 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:39 Reduce (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(1/1) switched to CANCELING > 17:19:39,173 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:39 DataSink (collect())(1/1) switched to CANCELED > 17:19:39,174 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Reduce (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (1/1) (d0f8f69f9047c3154b860850955de20f) switched from CANCELING to FAILED > 17:19:39,177 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:39 Reduce (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(1/1) switched to FAILED > java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager f213232054587f296a12140d56f63ed1 @ testing-worker-linux-docker-e6d6931f-3200-linux-4 - 2 slots - URL: akka.tcp://flink@172.17.0.253:43702/user/taskmanager > at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151) > at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547) > at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119) > at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156) > at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215) > at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:792) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) > at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) > at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) > at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) > at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) > at akka.actor.ActorCell.invoke(ActorCell.scala:486) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) > at akka.dispatch.Mailbox.run(Mailbox.scala:221) > at akka.dispatch.Mailbox.exec(Mailbox.scala:231) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 17:19:39,179 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:39 Job execution switched to status RESTARTING. > 17:19:39,179 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Delaying retry of job execution for 10000 ms ... > 17:19:39,179 INFO org.apache.flink.runtime.instance.InstanceManager - Unregistered task manager akka.tcp://flink@172.17.0.253:43702/user/taskmanager. Number of registered task managers 2. Number of available slots 4. > 17:19:39,179 INFO org.apache.flink.runtime.jobmanager.JobManager - Status of job fa05fd25993a8742da09cc5023c1e38d (Flink Java Job at Mon Jan 18 17:19:27 UTC 2016) changed to FAILING. > java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager f213232054587f296a12140d56f63ed1 @ testing-worker-linux-docker-e6d6931f-3200-linux-4 - 2 slots - URL: akka.tcp://flink@172.17.0.253:43702/user/taskmanager > at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151) > at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547) > at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119) > at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:156) > at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:215) > at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:792) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) > at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) > at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:100) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) > at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) > at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) > at akka.actor.ActorCell.invoke(ActorCell.scala:486) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) > at akka.dispatch.Mailbox.run(Mailbox.scala:221) > at akka.dispatch.Mailbox.exec(Mailbox.scala:231) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 17:19:39,180 INFO org.apache.flink.runtime.jobmanager.JobManager - Status of job fa05fd25993a8742da09cc5023c1e38d (Flink Java Job at Mon Jan 18 17:19:27 UTC 2016) changed to RESTARTING. > 17:19:42,766 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) (2/4) (d0d011dc0a0823bcec5a57a369b334ed) switched from FINISHED to FAILED > 17:19:42,773 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:42 CHAIN Partition -> Map (Map at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73)) -> Combine (Reduce at testTaskManagerFailure(TaskManagerProcessFailureBatchRecoveryITCase.java:73))(2/4) switched to FAILED > java.lang.IllegalStateException: Update task on instance f213232054587f296a12140d56f63ed1 @ testing-worker-linux-docker-e6d6931f-3200-linux-4 - 2 slots - URL: akka.tcp://flink@172.17.0.253:43702/user/taskmanager failed due to: > at org.apache.flink.runtime.executiongraph.Execution$5.onFailure(Execution.java:915) > at akka.dispatch.OnFailure.internal(Future.scala:228) > at akka.dispatch.OnFailure.internal(Future.scala:227) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:174) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:171) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) > at scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:25) > at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136) > at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:134) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@172.17.0.253:43702/user/taskmanager#-1712955384]] after [10000 ms] > at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333) > at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) > at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) > at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) > at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467) > at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419) > at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423) > at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375) > at java.lang.Thread.run(Thread.java:745) > 17:19:42,774 INFO org.apache.flink.runtime.jobmanager.JobManager - Status of job fa05fd25993a8742da09cc5023c1e38d (Flink Java Job at Mon Jan 18 17:19:27 UTC 2016) changed to FAILING. > java.lang.IllegalStateException: Update task on instance f213232054587f296a12140d56f63ed1 @ testing-worker-linux-docker-e6d6931f-3200-linux-4 - 2 slots - URL: akka.tcp://flink@172.17.0.253:43702/user/taskmanager failed due to: > at org.apache.flink.runtime.executiongraph.Execution$5.onFailure(Execution.java:915) > at akka.dispatch.OnFailure.internal(Future.scala:228) > at akka.dispatch.OnFailure.internal(Future.scala:227) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:174) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:171) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) > at scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:25) > at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136) > at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:134) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@172.17.0.253:43702/user/taskmanager#-1712955384]] after [10000 ms] > at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333) > at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) > at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) > at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) > at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467) > at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419) > at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423) > at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375) > at java.lang.Thread.run(Thread.java:745) > 17:19:42,780 INFO org.apache.flink.runtime.client.JobClientActor - 01/18/2016 17:19:42 Job execution switched to status FAILING. > java.lang.IllegalStateException: Update task on instance f213232054587f296a12140d56f63ed1 @ testing-worker-linux-docker-e6d6931f-3200-linux-4 - 2 slots - URL: akka.tcp://flink@172.17.0.253:43702/user/taskmanager failed due to: > at org.apache.flink.runtime.executiongraph.Execution$5.onFailure(Execution.java:915) > at akka.dispatch.OnFailure.internal(Future.scala:228) > at akka.dispatch.OnFailure.internal(Future.scala:227) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:174) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:171) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) > at scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:25) > at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136) > at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:134) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@172.17.0.253:43702/user/taskmanager#-1712955384]] after [10000 ms] > at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333) > at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) > at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) > at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) > at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467) > at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419) > at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423) > at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375) > at java.lang.Thread.run(Thread.java:745) > 17:19:49,152 WARN Remoting - Tried to associate with unreachable remote address [akka.tcp://flink@172.17.0.253:43702]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /172.17.0.253:43702 > 17:19:59,172 WARN Remoting - Tried to associate with unreachable remote address [akka.tcp://flink@172.17.0.253:43702]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /172.17.0.253:43702 > 17:20:09,191 WARN Remoting - Tried to associate with unreachable remote address [akka.tcp://flink@172.17.0.253:43702]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /172.17.0.253:43702 > 17:24:32,423 INFO org.apache.flink.runtime.jobmanager.JobManager - Stopping JobManager akka.tcp://flink@127.0.0.1:56722/user/jobmanager. > 17:24:32,440 ERROR org.apache.flink.test.recovery.TaskManagerProcessFailureBatchRecoveryITCase - > -------------------------------------------------------------------------------- > Test testTaskManagerProcessFailure[0](org.apache.flink.test.recovery.TaskManagerProcessFailureBatchRecoveryITCase) failed with: > java.lang.AssertionError: The program did not finish in time > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.assertTrue(Assert.java:41) > at org.junit.Assert.assertFalse(Assert.java:64) > at org.apache.flink.test.recovery.AbstractTaskManagerProcessFailureRecoveryTest.testTaskManagerProcessFailure(AbstractTaskManagerProcessFailureRecoveryTest.java:212) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) > at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > at org.junit.rules.RunRules.evaluate(RunRules.java:20) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) > at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) > at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) > at org.junit.runners.ParentRunner.run(ParentRunner.java:309) > at org.junit.runners.Suite.runChild(Suite.java:127) > at org.junit.runners.Suite.runChild(Suite.java:26) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) > at org.junit.runners.ParentRunner.run(ParentRunner.java:309) > at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:283) > at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:173) > at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153) > at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:128) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)