Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 67B8A200CE6 for ; Fri, 15 Sep 2017 18:51:04 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 660AD1609D1; Fri, 15 Sep 2017 16:51:04 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id ABB9B1609C9 for ; Fri, 15 Sep 2017 18:51:03 +0200 (CEST) Received: (qmail 26246 invoked by uid 500); 15 Sep 2017 16:51:01 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 26236 invoked by uid 99); 15 Sep 2017 16:51:01 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 15 Sep 2017 16:51:01 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 875111850C7 for ; Fri, 15 Sep 2017 16:51:00 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.794 X-Spam-Level: * X-Spam-Status: No, score=1.794 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, URIBL_BLOCKED=0.001, URI_HEX=1.313] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=greghogan-com.20150623.gappssmtp.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id y9TA9qIuQXl4 for ; Fri, 15 Sep 2017 16:50:59 +0000 (UTC) Received: from mail-qk0-f178.google.com (mail-qk0-f178.google.com [209.85.220.178]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 58D365F6C6 for ; Fri, 15 Sep 2017 16:50:58 +0000 (UTC) Received: by mail-qk0-f178.google.com with SMTP id s132so2599246qke.7 for ; Fri, 15 Sep 2017 09:50:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=greghogan-com.20150623.gappssmtp.com; s=20150623; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=vDxj2EdjMs9OHVoC75B3VwxSmAsI0mp1AvLYOUaxWAo=; b=2RhvT7ermsLt/9gp/ACaQO7dliUkuYg9rI/VScutpIJVvxHtpNR+Z3RviMpxe22t5f 65q39PO4MvRJsMwYFwjowuTtE9sgTux9K8ml2uDh5VidRqxVOAt6srHcUxV3ZOIehHiA 0wuYv2jb/uwcTL58iIjgwengpXEBmGqwBZiYAZ7rtWOhLwfmp8YRRIOvV/c7F8WHDB/h z+ilicpt4+KWkyirkWeMcLDbobQlmGlULOgfhYqsXpgsACcEMphgQxYKsNlX1ywhw5G8 hy5nf+qMJNzvMybYTXK7AjAV5rZEzk7cVMPEL6FjG8tzN5R99SICeU8lD1NsuYwzW30A hnUw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=vDxj2EdjMs9OHVoC75B3VwxSmAsI0mp1AvLYOUaxWAo=; b=MhCkNE36Q2g0iwmmW5evNYcLs6gqzW38j7p3AZTulGHiDVjlXBmyvcqCswZSwaYS73 +ZX6yD/8ttRJGmf4WNR/9cg+36j8OYrYR5e1qnZZ+EiizGYpNaBWW/BMmEBWlfS0tTq6 F8DXWSt/uL1yF/dMwO8p85OQAoD8Ji8dWX/OP8S5sIm+ERiR1pAYB62Cqr4SZVXA0f3O 7xc3xhdzlHjXjuSlfPpoIf3/Ri5sygDtj8Y0bL1Aefm+hYe/lmb7jHuxhDGj+MsOncQ4 +DY3DDo7PnlABV66d9NSXBAkcXc9xua3XGM/kAI83mo3v0N1oXI8Hmbm0c614RgvpkNr fgug== X-Gm-Message-State: AHPjjUjvdgcqXoa0iuCE1uYRUPaDnKK1gJ39AVMT9xQiMVRHi2q3ZVel fe9rrX0vIy9caYr5gVs4/Q== X-Google-Smtp-Source: AOwi7QDvOW9MDsalic0emN/pFaqSSF+0VOciv1jkDcBcWC+hoQAgm6aW3kq/wcJfWcRz1kqS8eoR4w== X-Received: by 10.55.21.197 with SMTP id 66mr3675072qkv.88.1505494257098; Fri, 15 Sep 2017 09:50:57 -0700 (PDT) Received: from [192.168.1.34] (static-108-48-124-130.washdc.fios.verizon.net. [108.48.124.130]) by smtp.gmail.com with ESMTPSA id 75sm856765qkx.58.2017.09.15.09.50.55 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 15 Sep 2017 09:50:55 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: Task Manager was lost/killed due to full GC From: Greg Hogan In-Reply-To: <1504616944415-0.post@n4.nabble.com> Date: Fri, 15 Sep 2017 12:50:54 -0400 Cc: user Content-Transfer-Encoding: quoted-printable Message-Id: References: <1504616944415-0.post@n4.nabble.com> To: ShB X-Mailer: Apple Mail (2.3273) archived-at: Fri, 15 Sep 2017 16:51:04 -0000 Late response, but a common reason for disappearing TaskManagers is = termination by the Linux out-of-memory killer, with the recommendation = to decrease the allotted memory. > On Sep 5, 2017, at 9:09 AM, ShB wrote: >=20 > Hi,=20 >=20 > I'm running a Flink batch job that reads almost 1 TB of data from S3 = and > then performs operations on it. A list of filenames are distributed = among > the TM's and each subset of files is read from S3 from each TM. This = job > errors out at the read step due to the following error: > java.lang.Exception: TaskManager was lost/killed >=20 > Having read similar questions on the mailing list, it seems like this = is a > memory issue, with full GC at the TM causing the TM to be lost.=20 >=20 > After enabling memory debugging this seems to be the stats just before > erroring out: > Memory usage stats: [HEAP: 8327/18704/18704 MB, NON HEAP: 79/81/-1 MB > (used/committed/max)] > Direct memory stats: Count: 5236, Total Capacity: 17148907, Used = Memory: > 17148908 > Off-heap pool stats: [Code Cache: 25/27/240 MB (used/committed/max)], > [Metaspace: 47/48/-1 MB (used/committed/max)], [Compressed Class = Space: > 5/5/1024 MB (used/committed/max)] > Garbage collector stats: [G1 Young Generation, GC TIME (ms): 16712, GC > COUNT: 290], [G1 Old Generation, GC TIME (ms): 689, GC COUNT: 2] >=20 > I tried all of these suggested fixes: decreased = taskmanager.memory.fraction > to give more memory to user managed operations, increased number of > JVM's(parallelism), used the G1 GC for better GC performance, but my = job > still errors out. =20 >=20 > I increased akka.watch.heartbeat.pause, akka.watch.threshold, > akka.watch.heartbeat.interval to prevent the timeout due to GC. But = this > doesn't help either. I figured with the really high values for death = watch, > the program would run really slowly and complete at some point but it = fails > anyway.=20 >=20 > I'm now trying to decrease object creation in my program, but so far = it > hasn't helped. >=20 > How can I go about debugging and fixing this problem? >=20 > Thank you.=20 >=20 >=20 >=20 >=20 > -- > Sent from: = http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/