Return-Path: X-Original-To: apmail-flink-dev-archive@www.apache.org Delivered-To: apmail-flink-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E4FFD179E7 for ; Fri, 30 Oct 2015 02:44:19 +0000 (UTC) Received: (qmail 43853 invoked by uid 500); 30 Oct 2015 02:44:19 -0000 Delivered-To: apmail-flink-dev-archive@flink.apache.org Received: (qmail 43792 invoked by uid 500); 30 Oct 2015 02:44:19 -0000 Mailing-List: contact dev-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list dev@flink.apache.org Received: (qmail 43780 invoked by uid 99); 30 Oct 2015 02:44:19 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Oct 2015 02:44:19 +0000 Received: from mail-oi0-f46.google.com (mail-oi0-f46.google.com [209.85.218.46]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 317D71A009C for ; Fri, 30 Oct 2015 02:44:19 +0000 (UTC) Received: by oifu63 with SMTP id u63so52491425oif.2 for ; Thu, 29 Oct 2015 19:44:18 -0700 (PDT) X-Received: by 10.202.108.139 with SMTP id h133mr3852435oic.53.1446173058122; Thu, 29 Oct 2015 19:44:18 -0700 (PDT) MIME-Version: 1.0 Received: by 10.182.167.3 with HTTP; Thu, 29 Oct 2015 19:43:58 -0700 (PDT) In-Reply-To: References: <3198663A-1403-4FA0-886C-780029C549AF@apache.org> From: Robert Metzger Date: Thu, 29 Oct 2015 22:43:58 -0400 Message-ID: Subject: Re: Diagnosing TaskManager disappearance To: "dev@flink.apache.org" Content-Type: multipart/alternative; boundary=001a1142ef9069cb370523496745 --001a1142ef9069cb370523496745 Content-Type: text/plain; charset=UTF-8 So is the TaskManager JVM still running after the JM detected that the TM has gone? If not, can you check the kernel log (dmesg) to see whether Linux OOM killer stopped the process? (if its a kill, the JVM might not be able to log anything anymore) On Thu, Oct 29, 2015 at 9:27 PM, Stephan Ewen wrote: > Thanks for sharing the logs, Greg! > > Okay, so the TaskManager does not crash, but the Remote Failure Detector of > Akka marks the connection between JobManager and TaskManager as broken. > > The TaskManager is not doing much GC, so it is not a long JVM freeze that > causes hearbeats to time out... > > I am wondering at this point whether this is an issue in Akka, specifically > the remote death watch that we use to let the JobManager recognize > disconnected TaskManagers. > > One thing you could try is actually to comment out the line where the > JobManager starts the death watch for the TaskManager and see if they can > still successfully exchange messages (tasks finished, find inputs, > schedule) and the program completes. That would indicate that the Akka > Death Watch is flawed and that we should probably do our own heartbeats > instead. > > Greetings, > Stephan > > > On Thu, Oct 29, 2015 at 11:44 AM, Aljoscha Krettek > wrote: > > > Could it be a problem that there are two TaskManagers running per > machine? > > > > > On 29 Oct 2015, at 19:04, Greg Hogan wrote: > > > > > > I have memory logging enabled. Tail of TaskManager log on 10.0.88.140: > > > > > > 17:35:26,415 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 341, GC COUNT: 3], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:27,415 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 576/1917/1917 MB, NON HEAP: 56/58/-1 MB > > > (used/committed/max)] > > > 17:35:27,415 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 18/19/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:27,415 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 341, GC COUNT: 3], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:28,012 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Received > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > (938/2322) > > > 17:35:28,015 INFO > > > org.apache.flink.runtime.taskmanager.Task - Loading > > JAR > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > (checksum()) > > > (938/2322) > > > 17:35:28,016 INFO > > > org.apache.flink.runtime.taskmanager.Task - > > Registering > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > (checksum()) > > > (938/2322) [DEPLOYING] > > > 17:35:28,065 INFO > > > org.apache.flink.runtime.taskmanager.Task - CHAIN > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (938/2322) > switched > > to > > > RUNNING > > > 17:35:28,100 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Received > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > (2304/2322) > > > 17:35:28,116 INFO > > > org.apache.flink.runtime.taskmanager.Task - Loading > > JAR > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > (checksum()) > > > (2304/2322) > > > 17:35:28,116 INFO > > > org.apache.flink.runtime.taskmanager.Task - > > Registering > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > (checksum()) > > > (2304/2322) [DEPLOYING] > > > 17:35:28,132 INFO > > > org.apache.flink.runtime.taskmanager.Task - CHAIN > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (2304/2322) > switched > > > to RUNNING > > > 17:35:28,255 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Received > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > (939/2322) > > > 17:35:28,263 INFO > > > org.apache.flink.runtime.taskmanager.Task - Loading > > JAR > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > (checksum()) > > > (939/2322) > > > 17:35:28,263 INFO > > > org.apache.flink.runtime.taskmanager.Task - > > Registering > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > (checksum()) > > > (939/2322) [DEPLOYING] > > > 17:35:28,304 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Received > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > (2062/2322) > > > 17:35:28,311 INFO > > > org.apache.flink.runtime.taskmanager.Task - Loading > > JAR > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > (checksum()) > > > (2062/2322) > > > 17:35:28,311 INFO > > > org.apache.flink.runtime.taskmanager.Task - > > Registering > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > (checksum()) > > > (2062/2322) [DEPLOYING] > > > 17:35:28,323 INFO > > > org.apache.flink.runtime.taskmanager.Task - CHAIN > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (939/2322) > switched > > to > > > RUNNING > > > 17:35:28,386 INFO > > > org.apache.flink.runtime.taskmanager.Task - CHAIN > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (2062/2322) > switched > > > to RUNNING > > > 17:35:28,396 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Received > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > (1775/2322) > > > 17:35:28,401 INFO > > > org.apache.flink.runtime.taskmanager.Task - Loading > > JAR > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > (checksum()) > > > (1775/2322) > > > 17:35:28,402 INFO > > > org.apache.flink.runtime.taskmanager.Task - > > Registering > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > (checksum()) > > > (1775/2322) [DEPLOYING] > > > 17:35:28,416 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 747/1917/1917 MB, NON HEAP: 56/58/-1 MB > > > (used/committed/max)] > > > 17:35:28,416 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 18/19/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:28,416 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 341, GC COUNT: 3], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:28,419 INFO > > > org.apache.flink.runtime.taskmanager.Task - CHAIN > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (1775/2322) > switched > > > to RUNNING > > > 17:35:28,475 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Received > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > (2158/2322) > > > 17:35:28,475 INFO > > > org.apache.flink.runtime.taskmanager.Task - Loading > > JAR > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > (checksum()) > > > (2158/2322) > > > 17:35:28,476 INFO > > > org.apache.flink.runtime.taskmanager.Task - > > Registering > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > (checksum()) > > > (2158/2322) [DEPLOYING] > > > 17:35:28,509 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Received > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > (1463/2322) > > > 17:35:28,860 INFO > > > org.apache.flink.runtime.taskmanager.Task - Loading > > JAR > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > (checksum()) > > > (1463/2322) > > > 17:35:28,861 INFO > > > org.apache.flink.runtime.taskmanager.Task - > > Registering > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > (checksum()) > > > (1463/2322) [DEPLOYING] > > > 17:35:28,862 INFO > > > org.apache.flink.runtime.taskmanager.Task - CHAIN > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (2158/2322) > switched > > > to RUNNING > > > 17:35:28,878 INFO > > > org.apache.flink.runtime.taskmanager.Task - CHAIN > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (1463/2322) > switched > > > to RUNNING > > > 17:35:28,892 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Received > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > (1154/2322) > > > 17:35:28,893 INFO > > > org.apache.flink.runtime.taskmanager.Task - Loading > > JAR > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > (checksum()) > > > (1154/2322) > > > 17:35:28,893 INFO > > > org.apache.flink.runtime.taskmanager.Task - > > Registering > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > (checksum()) > > > (1154/2322) [DEPLOYING] > > > 17:35:28,914 INFO > > > org.apache.flink.runtime.taskmanager.Task - CHAIN > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (1154/2322) > switched > > > to RUNNING > > > 17:35:28,916 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Received > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > (1429/2322) > > > 17:35:28,917 INFO > > > org.apache.flink.runtime.taskmanager.Task - Loading > > JAR > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > (checksum()) > > > (1429/2322) > > > 17:35:28,917 INFO > > > org.apache.flink.runtime.taskmanager.Task - > > Registering > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > (checksum()) > > > (1429/2322) [DEPLOYING] > > > 17:35:28,942 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Received > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > (1078/2322) > > > 17:35:28,942 INFO > > > org.apache.flink.runtime.taskmanager.Task - Loading > > JAR > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > (checksum()) > > > (1078/2322) > > > 17:35:28,942 INFO > > > org.apache.flink.runtime.taskmanager.Task - > > Registering > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > (checksum()) > > > (1078/2322) [DEPLOYING] > > > 17:35:28,943 INFO > > > org.apache.flink.runtime.taskmanager.Task - CHAIN > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (1429/2322) > switched > > > to RUNNING > > > 17:35:28,955 INFO > > > org.apache.flink.runtime.taskmanager.Task - CHAIN > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (1078/2322) > switched > > > to RUNNING > > > 17:35:28,959 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Received > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > (524/2322) > > > 17:35:28,995 INFO > > > org.apache.flink.runtime.taskmanager.Task - Loading > > JAR > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > (checksum()) > > > (524/2322) > > > 17:35:28,995 INFO > > > org.apache.flink.runtime.taskmanager.Task - > > Registering > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > (checksum()) > > > (524/2322) [DEPLOYING] > > > 17:35:29,000 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Received > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > (2021/2322) > > > 17:35:29,000 INFO > > > org.apache.flink.runtime.taskmanager.Task - Loading > > JAR > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > (checksum()) > > > (2021/2322) > > > 17:35:29,000 INFO > > > org.apache.flink.runtime.taskmanager.Task - > > Registering > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > (checksum()) > > > (2021/2322) [DEPLOYING] > > > 17:35:29,012 INFO > > > org.apache.flink.runtime.taskmanager.Task - CHAIN > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (524/2322) > switched > > to > > > RUNNING > > > 17:35:29,039 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Received > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > (2022/2322) > > > 17:35:29,039 INFO > > > org.apache.flink.runtime.taskmanager.Task - CHAIN > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (2021/2322) > switched > > > to RUNNING > > > 17:35:29,043 INFO > > > org.apache.flink.runtime.taskmanager.Task - Loading > > JAR > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > (checksum()) > > > (2022/2322) > > > 17:35:29,043 INFO > > > org.apache.flink.runtime.taskmanager.Task - > > Registering > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > (checksum()) > > > (2022/2322) [DEPLOYING] > > > 17:35:29,076 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Received > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > (1464/2322) > > > 17:35:29,081 INFO > > > org.apache.flink.runtime.taskmanager.Task - Loading > > JAR > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > (checksum()) > > > (1464/2322) > > > 17:35:29,081 INFO > > > org.apache.flink.runtime.taskmanager.Task - > > Registering > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > (checksum()) > > > (1464/2322) [DEPLOYING] > > > 17:35:29,095 INFO > > > org.apache.flink.runtime.taskmanager.Task - CHAIN > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (2022/2322) > switched > > > to RUNNING > > > 17:35:29,108 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Received > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > (1095/2322) > > > 17:35:29,110 INFO > > > org.apache.flink.runtime.taskmanager.Task - CHAIN > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (1464/2322) > switched > > > to RUNNING > > > 17:35:29,112 INFO > > > org.apache.flink.runtime.taskmanager.Task - Loading > > JAR > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > (checksum()) > > > (1095/2322) > > > 17:35:29,112 INFO > > > org.apache.flink.runtime.taskmanager.Task - > > Registering > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > (checksum()) > > > (1095/2322) [DEPLOYING] > > > 17:35:29,140 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Received > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > (2306/2322) > > > 17:35:29,142 INFO > > > org.apache.flink.runtime.taskmanager.Task - Loading > > JAR > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > (checksum()) > > > (2306/2322) > > > 17:35:29,142 INFO > > > org.apache.flink.runtime.taskmanager.Task - > > Registering > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > (checksum()) > > > (2306/2322) [DEPLOYING] > > > 17:35:29,147 INFO > > > org.apache.flink.runtime.taskmanager.Task - CHAIN > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (1095/2322) > switched > > > to RUNNING > > > 17:35:29,152 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Received > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > (974/2322) > > > 17:35:29,153 INFO > > > org.apache.flink.runtime.taskmanager.Task - CHAIN > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (2306/2322) > switched > > > to RUNNING > > > 17:35:29,155 INFO > > > org.apache.flink.runtime.taskmanager.Task - Loading > > JAR > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > (checksum()) > > > (974/2322) > > > 17:35:29,155 INFO > > > org.apache.flink.runtime.taskmanager.Task - > > Registering > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > (checksum()) > > > (974/2322) [DEPLOYING] > > > 17:35:29,166 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Received > > > task CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) > > (2305/2322) > > > 17:35:29,167 INFO > > > org.apache.flink.runtime.taskmanager.Task - Loading > > JAR > > > files for task CHAIN GroupReduce (Compute scores) -> FlatMap > (checksum()) > > > (2305/2322) > > > 17:35:29,167 INFO > > > org.apache.flink.runtime.taskmanager.Task - > > Registering > > > task at network: CHAIN GroupReduce (Compute scores) -> FlatMap > > (checksum()) > > > (2305/2322) [DEPLOYING] > > > 17:35:29,176 INFO > > > org.apache.flink.runtime.taskmanager.Task - CHAIN > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (974/2322) > switched > > to > > > RUNNING > > > 17:35:29,205 INFO > > > org.apache.flink.runtime.taskmanager.Task - CHAIN > > > GroupReduce (Compute scores) -> FlatMap (checksum()) (2305/2322) > switched > > > to RUNNING > > > 17:35:29,417 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 590/1917/1917 MB, NON HEAP: 57/59/-1 MB > > > (used/committed/max)] > > > 17:35:29,417 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 18/19/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:29,417 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:30,418 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 614/1917/1917 MB, NON HEAP: 57/59/-1 MB > > > (used/committed/max)] > > > 17:35:30,418 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 18/19/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:30,418 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:31,418 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 634/1917/1917 MB, NON HEAP: 57/59/-1 MB > > > (used/committed/max)] > > > 17:35:31,418 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/19/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:31,419 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:32,419 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 638/1917/1917 MB, NON HEAP: 57/59/-1 MB > > > (used/committed/max)] > > > 17:35:32,419 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/19/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:32,419 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:33,487 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 648/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > (used/committed/max)] > > > 17:35:33,494 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/19/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:33,522 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:34,523 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 662/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > (used/committed/max)] > > > 17:35:34,523 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/19/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:34,523 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:35,523 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 670/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > (used/committed/max)] > > > 17:35:35,524 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/19/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:35,524 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:36,525 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 717/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > (used/committed/max)] > > > 17:35:36,525 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/19/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:36,525 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:37,525 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 737/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > (used/committed/max)] > > > 17:35:37,525 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/19/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:37,525 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:38,525 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 747/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > (used/committed/max)] > > > 17:35:38,525 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:38,525 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:39,526 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 817/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > (used/committed/max)] > > > 17:35:39,526 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:39,526 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:40,526 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 832/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > (used/committed/max)] > > > 17:35:40,526 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:40,526 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:41,527 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 840/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > (used/committed/max)] > > > 17:35:41,527 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:41,527 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:42,527 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 847/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > (used/committed/max)] > > > 17:35:42,527 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:42,527 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 689, GC COUNT: 4], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:43,599 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 450/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > (used/committed/max)] > > > 17:35:43,599 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:43,599 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:44,599 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 508/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > (used/committed/max)] > > > 17:35:44,599 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:44,599 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:45,600 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 517/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > (used/committed/max)] > > > 17:35:45,600 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:45,600 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:46,600 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 528/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > (used/committed/max)] > > > 17:35:46,600 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:46,600 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:47,663 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 541/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > (used/committed/max)] > > > 17:35:47,664 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:47,664 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:48,791 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 554/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > (used/committed/max)] > > > 17:35:48,791 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:48,791 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:49,794 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 562/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > (used/committed/max)] > > > 17:35:49,795 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:49,795 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:50,795 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 569/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > (used/committed/max)] > > > 17:35:50,795 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:50,795 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:51,795 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 582/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > (used/committed/max)] > > > 17:35:51,795 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:51,795 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:52,796 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 593/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > (used/committed/max)] > > > 17:35:52,796 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:52,796 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:53,796 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 600/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > (used/committed/max)] > > > 17:35:53,796 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:53,796 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:54,797 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 604/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > (used/committed/max)] > > > 17:35:54,797 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:54,797 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:55,797 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 610/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > (used/committed/max)] > > > 17:35:55,797 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:55,797 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:56,797 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 615/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > (used/committed/max)] > > > 17:35:56,798 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:56,798 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:57,798 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 624/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > (used/committed/max)] > > > 17:35:57,798 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:57,798 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:58,798 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 636/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > (used/committed/max)] > > > 17:35:58,798 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:58,798 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:35:59,799 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 641/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > (used/committed/max)] > > > 17:35:59,799 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:35:59,799 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:36:00,799 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 648/1917/1917 MB, NON HEAP: 58/59/-1 MB > > > (used/committed/max)] > > > 17:36:00,799 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > [Metaspace: > > > 34/34/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:36:00,799 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:36:01,821 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 655/1917/1917 MB, NON HEAP: 58/60/-1 MB > > > (used/committed/max)] > > > 17:36:01,936 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > [Metaspace: > > > 34/35/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:36:01,936 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:36:02,937 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 665/1917/1917 MB, NON HEAP: 58/60/-1 MB > > > (used/committed/max)] > > > 17:36:02,937 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > [Metaspace: > > > 34/35/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:36:02,937 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > 17:36:03,944 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Memory > > > usage stats: [HEAP: 666/1917/1917 MB, NON HEAP: 58/60/-1 MB > > > (used/committed/max)] > > > 17:36:03,950 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - > Off-heap > > > pool stats: [Code Cache: 19/20/240 MB (used/committed/max)], > [Metaspace: > > > 34/35/-1 MB (used/committed/max)], [Compressed Class Space: 4/4/1024 MB > > > (used/committed/max)] > > > 17:36:03,951 INFO > > > org.apache.flink.runtime.taskmanager.TaskManager - Garbage > > > collector stats: [PS Scavenge, GC TIME (ms): 797, GC COUNT: 5], [PS > > > MarkSweep, GC TIME (ms): 974, GC COUNT: 1] > > > > > > On Thu, Oct 29, 2015 at 1:55 PM, Till Rohrmann > > wrote: > > > > > >> What does the log of the failed TaskManager 10.0.88.140 say? > > >> > > >> On Thu, Oct 29, 2015 at 6:44 PM, Greg Hogan > wrote: > > >> > > >>> I removed the use of numactl but left in starting two TaskManagers > and > > am > > >>> still seeing TaskManagers crash. > > >>> From the JobManager log: > > >>> > > >>> 17:36:06,412 WARN > > >>> akka.remote.ReliableDeliverySupervisor - > > >> Association > > >>> with remote system [akka.tcp://flink@10.0.88.140:45742] has failed, > > >>> address > > >>> is now gated for [5000] ms. Reason is: [Disassociated]. > > >>> 17:36:06,567 INFO > > >>> org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN > > >>> GroupReduce (Compute scores) -> FlatMap (checksum()) (370/2322) > > >>> (cac9927a8568c2ad79439262a91478af) switched from RUNNING to FAILED > > >>> 17:36:06,572 INFO > > >>> org.apache.flink.runtime.jobmanager.JobManager - > Status > > of > > >>> job 14d946015fd7b35eb801ea6fee5af9e4 (Flink Java Job at Thu Oct 29 > > >> 17:34:48 > > >>> UTC 2015) changed to FAILING. > > >>> java.lang.Exception: The data preparation for task 'CHAIN GroupReduce > > >>> (Compute scores) -> FlatMap (checksum())' , caused an error: Error > > >>> obtaining the sorted input: Thread 'SortMerger Reading Thread' > > terminated > > >>> due to an exception: Connection unexpectedly closed by remote task > > >> manager > > >>> 'ip-10-0-88-140/10.0.88.140:58558'. This might indicate that the > > remote > > >>> task manager was lost. > > >>> at > > >>> org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:465) > > >>> at > > >>> > org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:354) > > >>> at > org.apache.flink.runtime.taskmanager.Task.run(Task.java:584) > > >>> at java.lang.Thread.run(Thread.java:745) > > >>> Caused by: java.lang.RuntimeException: Error obtaining the sorted > > input: > > >>> Thread 'SortMerger Reading Thread' terminated due to an exception: > > >>> Connection unexpectedly closed by remote task manager > 'ip-10-0-88-140/ > > >>> 10.0.88.140:58558'. This might indicate that the remote task manager > > was > > >>> lost. > > >>> at > > >>> > > >>> > > >> > > > org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:619) > > >>> at > > >>> > > >> > > > org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1089) > > >>> at > > >>> > > >>> > > >> > > > org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:94) > > >>> at > > >>> org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:459) > > >>> ... 3 more > > >>> Caused by: java.io.IOException: Thread 'SortMerger Reading Thread' > > >>> terminated due to an exception: Connection unexpectedly closed by > > remote > > >>> task manager 'ip-10-0-88-140/10.0.88.140:58558'. This might indicate > > >> that > > >>> the remote task manager was lost. > > >>> at > > >>> > > >>> > > >> > > > org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:800) > > >>> Caused by: > > >>> > > >>> > > >> > > > org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: > > >>> Connection unexpectedly closed by remote task manager > 'ip-10-0-88-140/ > > >>> 10.0.88.140:58558'. This might indicate that the remote task manager > > was > > >>> lost. > > >>> at > > >>> > > >>> > > >> > > > org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:119) > > >>> at > > >>> > > >>> > > >> > > > io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208) > > >>> at > > >>> > > >>> > > >> > > > io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194) > > >>> at > > >>> > > >>> > > >> > > > io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75) > > >>> at > > >>> > > >>> > > >> > > > io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208) > > >>> at > > >>> > > >>> > > >> > > > io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194) > > >>> at > > >>> > > >>> > > >> > > > io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:306) > > >>> at > > >>> > > >>> > > >> > > > io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208) > > >>> at > > >>> > > >>> > > >> > > > io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194) > > >>> at > > >>> > > >>> > > >> > > > io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:828) > > >>> at > > >>> > > >>> > > >> > > > io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:621) > > >>> at > > >>> > > >>> > > >> > > > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358) > > >>> at > io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) > > >>> at > > >>> > > >>> > > >> > > > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112) > > >>> at java.lang.Thread.run(Thread.java:745) > > >>> 17:36:06,587 INFO > > >>> org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN > > >>> GroupReduce (Compute scores) -> FlatMap (checksum()) (367/2322) > > >>> (d63c681a18b8164bc24936df1ecb159b) switched from RUNNING to FAILED > > >>> > > >>> > > >>> On Thu, Oct 29, 2015 at 1:00 PM, Stephan Ewen > > wrote: > > >>> > > >>>> Hi Greg! > > >>>> > > >>>> Interesting... When you say the TaskManagers are dropping, are the > > >>>> TaskManager processes crashing, or are they loosing connection to > the > > >>>> JobManager? > > >>>> > > >>>> Greetings, > > >>>> Stephan > > >>>> > > >>>> > > >>>> On Thu, Oct 29, 2015 at 9:56 AM, Greg Hogan > > >> wrote: > > >>>> > > >>>>> I recently discovered that AWS uses NUMA for its largest nodes. An > > >>>> example > > >>>>> c4.8xlarge: > > >>>>> > > >>>>> $ numactl --hardware > > >>>>> available: 2 nodes (0-1) > > >>>>> node 0 cpus: 0 1 2 3 4 5 6 7 8 18 19 20 21 22 23 24 25 26 > > >>>>> node 0 size: 29813 MB > > >>>>> node 0 free: 24537 MB > > >>>>> node 1 cpus: 9 10 11 12 13 14 15 16 17 27 28 29 30 31 32 33 34 35 > > >>>>> node 1 size: 30574 MB > > >>>>> node 1 free: 22757 MB > > >>>>> node distances: > > >>>>> node 0 1 > > >>>>> 0: 10 20 > > >>>>> 1: 20 10 > > >>>>> > > >>>>> I discovered yesterday that Flink performed ~20-30% faster on large > > >>>>> datasets by running two NUMA-constrained TaskManagers per node. The > > >>>>> JobManager node ran a single TaskManager. Resources were divided in > > >>> half > > >>>>> relative to running a single TaskManager. > > >>>>> > > >>>>> The changes from the tail of /bin/taskmanager.sh: > > >>>>> > > >>>>> -"${FLINK_BIN_DIR}"/flink-daemon.sh $STARTSTOP taskmanager > > >> "${args[@]}" > > >>>>> +numactl --membind=0 --cpunodebind=0 > > >> "${FLINK_BIN_DIR}"/flink-daemon.sh > > >>>>> $STARTSTOP taskmanager "${args[@]}" > > >>>>> +numactl --membind=1 --cpunodebind=1 > > >> "${FLINK_BIN_DIR}"/flink-daemon.sh > > >>>>> $STARTSTOP taskmanager "${args[@]}" > > >>>>> > > >>>>> After reverting this change the system is again stable. I had not > > >>>>> experienced issues using numactl when running 16 nodes. > > >>>>> > > >>>>> Greg > > >>>>> > > >>>> > > >>> > > >> > > > > > --001a1142ef9069cb370523496745--