Return-Path: X-Original-To: apmail-flink-dev-archive@www.apache.org Delivered-To: apmail-flink-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D562E17E7A for ; Thu, 29 Oct 2015 12:51:15 +0000 (UTC) Received: (qmail 40823 invoked by uid 500); 29 Oct 2015 12:51:15 -0000 Delivered-To: apmail-flink-dev-archive@flink.apache.org Received: (qmail 40766 invoked by uid 500); 29 Oct 2015 12:51:15 -0000 Mailing-List: contact dev-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list dev@flink.apache.org Received: (qmail 40755 invoked by uid 99); 29 Oct 2015 12:51:15 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Oct 2015 12:51:15 +0000 Received: from mail-vk0-f43.google.com (mail-vk0-f43.google.com [209.85.213.43]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 2ABA21A0040 for ; Thu, 29 Oct 2015 12:51:15 +0000 (UTC) Received: by vkgy127 with SMTP id y127so25650353vkg.0 for ; Thu, 29 Oct 2015 05:51:14 -0700 (PDT) X-Gm-Message-State: ALoCoQne//uK9oERRY55rW14h+GVtOIlc22IyYLVEFoAQ9vwx2lFS03x3Z1JYOaM7JdNmJbbS0s0 X-Received: by 10.31.141.130 with SMTP id p124mr902044vkd.44.1446123074281; Thu, 29 Oct 2015 05:51:14 -0700 (PDT) MIME-Version: 1.0 Received: by 10.31.61.133 with HTTP; Thu, 29 Oct 2015 05:50:54 -0700 (PDT) In-Reply-To: References: From: Maximilian Michels Date: Thu, 29 Oct 2015 13:50:54 +0100 Message-ID: Subject: Re: Diagnosing TaskManager disappearance To: dev@flink.apache.org Content-Type: multipart/alternative; boundary=001a11425a4e24fcf205233dc4a4 --001a11425a4e24fcf205233dc4a4 Content-Type: text/plain; charset=UTF-8 Hi Greg, Thanks for reporting. You wrote you didn't see any output in the .out files of the task managers. What about the .log files of these instances? Where and when did you produce the thread dump you included? Thanks, Max On Thu, Oct 29, 2015 at 1:46 PM, Greg Hogan wrote: > I am testing again on a 64 node cluster (the JobManager is running fine > having reduced some operator's parallelism and fixed the string conversion > performance). > > I am seeing TaskManagers drop like flies every other job or so. I am not > seeing any output in the .out log files corresponding to the crashed > TaskManagers. > > Below is the stack trace from a java.hprof heap dump. > > How should I be debugging this? > > Thanks, > Greg > > > Threads at the heap dump: > > Unknown thread > > > "Memory Logger" daemon prio=1 tid=119 TIMED_WAITING > at java.lang.Thread.(Thread.java:507) > at > > org.apache.flink.runtime.taskmanager.MemoryLogger.(MemoryLogger.java:67) > at > > org.apache.flink.runtime.taskmanager.TaskManager$.runTaskManager(TaskManager.scala:1494) > at > > org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1330) > > > "Flink Netty Server (59693) Thread 0" daemon prio=5 tid=193 RUNNABLE > at java.lang.Thread.(Thread.java:674) > at > > java.util.concurrent.Executors$DefaultThreadFactory.newThread(Executors.java:613) > at > > org.apache.flink.shaded.com.google.common.util.concurrent.ThreadFactoryBuilder$1.newThread(ThreadFactoryBuilder.java:162) > at > > io.netty.util.concurrent.SingleThreadEventExecutor.(SingleThreadEventExecutor.java:106) > > > "flink-akka.remote.default-remote-dispatcher-6" daemon prio=5 tid=30 > TIMED_WAITING > at java.lang.Thread.(Thread.java:507) > at > > scala.concurrent.forkjoin.ForkJoinWorkerThread.(ForkJoinWorkerThread.java:48) > at > > akka.dispatch.MonitorableThreadFactory$AkkaForkJoinWorkerThread.(ThreadPoolBuilder.scala:164) > at > > akka.dispatch.MonitorableThreadFactory.newThread(ThreadPoolBuilder.scala:187) > > > "flink-akka.actor.default-dispatcher-4" daemon prio=5 tid=28 WAITING > at java.lang.Thread.(Thread.java:507) > at > > scala.concurrent.forkjoin.ForkJoinWorkerThread.(ForkJoinWorkerThread.java:48) > at > > akka.dispatch.MonitorableThreadFactory$AkkaForkJoinWorkerThread.(ThreadPoolBuilder.scala:164) > at > > akka.dispatch.MonitorableThreadFactory.newThread(ThreadPoolBuilder.scala:187) > > > "flink-akka.remote.default-remote-dispatcher-5" daemon prio=5 tid=29 > WAITING > at java.lang.Thread.(Thread.java:507) > at > > scala.concurrent.forkjoin.ForkJoinWorkerThread.(ForkJoinWorkerThread.java:48) > at > > akka.dispatch.MonitorableThreadFactory$AkkaForkJoinWorkerThread.(ThreadPoolBuilder.scala:164) > at > > akka.dispatch.MonitorableThreadFactory.newThread(ThreadPoolBuilder.scala:187) > > > "flink-akka.actor.default-dispatcher-2" daemon prio=5 tid=26 WAITING > at java.lang.Thread.(Thread.java:507) > at > > scala.concurrent.forkjoin.ForkJoinWorkerThread.(ForkJoinWorkerThread.java:48) > at > > akka.dispatch.MonitorableThreadFactory$AkkaForkJoinWorkerThread.(ThreadPoolBuilder.scala:164) > at > > akka.dispatch.MonitorableThreadFactory.newThread(ThreadPoolBuilder.scala:187) > > > "SIGTERM handler" daemon prio=9 tid=268 RUNNABLE > at java.lang.Thread.(Thread.java:547) > at sun.misc.Signal.dispatch(Signal.java:216) > > > "HPROF gc_finish watcher" daemon prio=10 tid=5 RUNNABLE > > > "Reference Handler" daemon prio=10 tid=2 WAITING > > > "main" prio=5 tid=1 WAITING > > > "Signal Dispatcher" daemon prio=9 tid=4 RUNNABLE > > > "Finalizer" daemon prio=8 tid=3 WAITING > > > "flink-akka.actor.default-dispatcher-3" daemon prio=5 tid=27 TIMED_WAITING > at java.lang.Thread.(Thread.java:507) > at > > scala.concurrent.forkjoin.ForkJoinWorkerThread.(ForkJoinWorkerThread.java:48) > at > > akka.dispatch.MonitorableThreadFactory$AkkaForkJoinWorkerThread.(ThreadPoolBuilder.scala:164) > at > > akka.dispatch.MonitorableThreadFactory.newThread(ThreadPoolBuilder.scala:187) > > > "New I/O worker #1" daemon prio=5 tid=31 RUNNABLE > at java.lang.Thread.(Thread.java:547) > at > > akka.dispatch.MonitorableThreadFactory.newThread(ThreadPoolBuilder.scala:193) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612) > at > > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925) > > > "flink-scheduler-1" daemon prio=5 tid=25 TIMED_WAITING > at java.lang.Thread.(Thread.java:547) > at > > akka.dispatch.MonitorableThreadFactory.newThread(ThreadPoolBuilder.scala:193) > at akka.actor.LightArrayRevolverScheduler.(Scheduler.scala:337) > at > > sun.reflect.NativeConstructorAccessorImpl.newInstance0(NativeConstructorAccessorImpl.java) > > > "New I/O worker #2" daemon prio=5 tid=32 RUNNABLE > at java.lang.Thread.(Thread.java:547) > at > > akka.dispatch.MonitorableThreadFactory.newThread(ThreadPoolBuilder.scala:193) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612) > at > > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925) > > > "Hashed wheel timer #1" daemon prio=5 tid=33 TIMED_WAITING > at java.lang.Thread.(Thread.java:547) > at > > akka.dispatch.MonitorableThreadFactory.newThread(ThreadPoolBuilder.scala:193) > at org.jboss.netty.util.HashedWheelTimer.(HashedWheelTimer.java:226) > Local Variable: java.util.ArrayList#502 > at org.jboss.netty.util.HashedWheelTimer.(HashedWheelTimer.java:177) > Local Variable: java.lang.String#15234 > > > "New I/O boss #3" daemon prio=5 tid=34 RUNNABLE > at java.lang.Thread.(Thread.java:547) > at > > akka.dispatch.MonitorableThreadFactory.newThread(ThreadPoolBuilder.scala:193) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612) > at > > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925) > > > "Timer-0" daemon prio=5 tid=267 TIMED_WAITING > at java.lang.Thread.(Thread.java:444) > at java.util.TimerThread.(Timer.java:499) > at java.util.Timer.(Timer.java:101) > at java.util.Timer.(Timer.java:146) > > > "New I/O worker #4" daemon prio=5 tid=35 RUNNABLE > at java.lang.Thread.(Thread.java:547) > at > > akka.dispatch.MonitorableThreadFactory.newThread(ThreadPoolBuilder.scala:193) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612) > at > > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925) > > > "New I/O worker #5" daemon prio=5 tid=36 RUNNABLE > at java.lang.Thread.(Thread.java:547) > at > > akka.dispatch.MonitorableThreadFactory.newThread(ThreadPoolBuilder.scala:193) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612) > at > > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925) > > > "New I/O server boss #6" daemon prio=5 tid=37 RUNNABLE > at java.lang.Thread.(Thread.java:547) > at > > akka.dispatch.MonitorableThreadFactory.newThread(ThreadPoolBuilder.scala:193) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.(ThreadPoolExecutor.java:612) > at > > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925) > --001a11425a4e24fcf205233dc4a4--