Return-Path: X-Original-To: apmail-flink-dev-archive@www.apache.org Delivered-To: apmail-flink-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 30BF118CC2 for ; Thu, 29 Oct 2015 17:44:42 +0000 (UTC) Received: (qmail 24996 invoked by uid 500); 29 Oct 2015 17:44:42 -0000 Delivered-To: apmail-flink-dev-archive@flink.apache.org Received: (qmail 24940 invoked by uid 500); 29 Oct 2015 17:44:42 -0000 Mailing-List: contact dev-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list dev@flink.apache.org Received: (qmail 24928 invoked by uid 99); 29 Oct 2015 17:44:41 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Oct 2015 17:44:41 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 3E4EE1A2476 for ; Thu, 29 Oct 2015 17:44:41 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.002 X-Spam-Level: *** X-Spam-Status: No, score=3.002 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=3, URIBL_BLOCKED=0.001, WEIRD_PORT=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=greghogan_com.20150623.gappssmtp.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id dgpa6R2AYimw for ; Thu, 29 Oct 2015 17:44:28 +0000 (UTC) Received: from mail-ig0-f176.google.com (mail-ig0-f176.google.com [209.85.213.176]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 81D1D20F0A for ; Thu, 29 Oct 2015 17:44:27 +0000 (UTC) Received: by igbdj2 with SMTP id dj2so35933942igb.1 for ; Thu, 29 Oct 2015 10:44:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=greghogan_com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=PiEZLI7tAa5fmp0zua3nwNUzMqpVxHxec4oeY+EogZo=; b=n3xjEopKCyIcCEqIkZ4/y3PCuZMPgzKxcuipCInjVfHR4A3cNwmUHoYwgDjn05kvYE XGZtpIjHR/iUGoZNLLH6nuCNTeaTe8I9nQuwBQ9kWmai87HRZUF25URI77jsT7Wgr1TE NWrJYYOYomWNzl/SMZSspegltuwemQrIz2gj52JmkhYHW/hxZ7THQU0FxR72fyZPfF03 Ihia5fWux3xDoCQYVvQSY4/YQs8cUOAOq07ynWMrs4irCEEho/X5FPUov0UMeFCYn28Z 9w3/S8fa/OoqKbF1Ugqe0HIEABeaF9QgXQJEvs1pOa1Xx97qvXHb8BxqM0ccmSZxc+kI rnuw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=PiEZLI7tAa5fmp0zua3nwNUzMqpVxHxec4oeY+EogZo=; b=QuKsqknL0UiMlYV0pFnQff38F5a9xlkSyq+E/c1XPVdeq/qBdH30k8kc/84l5UbVAv KUpVG4WffRIOiEUQ5O/oo7hVB3XpcAvgpNyBfekMc6/opZ4piRAjW9aICqiNY71ot/Dc QubNhaXiJbzLgXnXU+b/kuBi+SriwYbGMbhoPNBdUGFvWTcDw0u8Thr5//y3vr8M4DJf 5HM+njcIaGI+OxaRyEqlKBuNHG6x/XePwGi01dlFWmfqSPQhIJkKR6KuAoso5jQSDkXW FW6Mo5zVl8Ng5ZSlgvvn+VlphPbgET0c/n9o/f6eBx76ghSHUiG23aKDzYmhlJrO9PXj hH5g== X-Gm-Message-State: ALoCoQmcwFtucjnX/6GJdsA8eEkQGyamgSVIQLv9kFx02KxpX4mVYsHRt2x5YzdEeTDBke00nH3Z MIME-Version: 1.0 X-Received: by 10.50.160.37 with SMTP id xh5mr5262912igb.6.1446140666233; Thu, 29 Oct 2015 10:44:26 -0700 (PDT) Received: by 10.64.30.15 with HTTP; Thu, 29 Oct 2015 10:44:26 -0700 (PDT) X-Originating-IP: [144.51.242.14] In-Reply-To: References: Date: Thu, 29 Oct 2015 13:44:26 -0400 Message-ID: Subject: Re: Diagnosing TaskManager disappearance From: Greg Hogan To: dev@flink.apache.org Content-Type: multipart/alternative; boundary=001a11349b2cb4f925052341dc44 --001a11349b2cb4f925052341dc44 Content-Type: text/plain; charset=UTF-8 I removed the use of numactl but left in starting two TaskManagers and am still seeing TaskManagers crash. >From the JobManager log: 17:36:06,412 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@10.0.88.140:45742] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 17:36:06,567 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) (370/2322) (cac9927a8568c2ad79439262a91478af) switched from RUNNING to FAILED 17:36:06,572 INFO org.apache.flink.runtime.jobmanager.JobManager - Status of job 14d946015fd7b35eb801ea6fee5af9e4 (Flink Java Job at Thu Oct 29 17:34:48 UTC 2015) changed to FAILING. java.lang.Exception: The data preparation for task 'CHAIN GroupReduce (Compute scores) -> FlatMap (checksum())' , caused an error: Error obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due to an exception: Connection unexpectedly closed by remote task manager 'ip-10-0-88-140/10.0.88.140:58558'. This might indicate that the remote task manager was lost. at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:465) at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:354) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: Error obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due to an exception: Connection unexpectedly closed by remote task manager 'ip-10-0-88-140/ 10.0.88.140:58558'. This might indicate that the remote task manager was lost. at org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:619) at org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1089) at org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:94) at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:459) ... 3 more Caused by: java.io.IOException: Thread 'SortMerger Reading Thread' terminated due to an exception: Connection unexpectedly closed by remote task manager 'ip-10-0-88-140/10.0.88.140:58558'. This might indicate that the remote task manager was lost. at org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:800) Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'ip-10-0-88-140/ 10.0.88.140:58558'. This might indicate that the remote task manager was lost. at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:119) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194) at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194) at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:306) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208) at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194) at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:828) at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:621) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112) at java.lang.Thread.run(Thread.java:745) 17:36:06,587 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN GroupReduce (Compute scores) -> FlatMap (checksum()) (367/2322) (d63c681a18b8164bc24936df1ecb159b) switched from RUNNING to FAILED On Thu, Oct 29, 2015 at 1:00 PM, Stephan Ewen wrote: > Hi Greg! > > Interesting... When you say the TaskManagers are dropping, are the > TaskManager processes crashing, or are they loosing connection to the > JobManager? > > Greetings, > Stephan > > > On Thu, Oct 29, 2015 at 9:56 AM, Greg Hogan wrote: > > > I recently discovered that AWS uses NUMA for its largest nodes. An > example > > c4.8xlarge: > > > > $ numactl --hardware > > available: 2 nodes (0-1) > > node 0 cpus: 0 1 2 3 4 5 6 7 8 18 19 20 21 22 23 24 25 26 > > node 0 size: 29813 MB > > node 0 free: 24537 MB > > node 1 cpus: 9 10 11 12 13 14 15 16 17 27 28 29 30 31 32 33 34 35 > > node 1 size: 30574 MB > > node 1 free: 22757 MB > > node distances: > > node 0 1 > > 0: 10 20 > > 1: 20 10 > > > > I discovered yesterday that Flink performed ~20-30% faster on large > > datasets by running two NUMA-constrained TaskManagers per node. The > > JobManager node ran a single TaskManager. Resources were divided in half > > relative to running a single TaskManager. > > > > The changes from the tail of /bin/taskmanager.sh: > > > > -"${FLINK_BIN_DIR}"/flink-daemon.sh $STARTSTOP taskmanager "${args[@]}" > > +numactl --membind=0 --cpunodebind=0 "${FLINK_BIN_DIR}"/flink-daemon.sh > > $STARTSTOP taskmanager "${args[@]}" > > +numactl --membind=1 --cpunodebind=1 "${FLINK_BIN_DIR}"/flink-daemon.sh > > $STARTSTOP taskmanager "${args[@]}" > > > > After reverting this change the system is again stable. I had not > > experienced issues using numactl when running 16 nodes. > > > > Greg > > > --001a11349b2cb4f925052341dc44--