Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 8F1C7200D35 for ; Tue, 7 Nov 2017 14:36:01 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 8DA0F160BED; Tue, 7 Nov 2017 13:36:01 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id AB9741609C8 for ; Tue, 7 Nov 2017 14:36:00 +0100 (CET) Received: (qmail 99232 invoked by uid 500); 7 Nov 2017 13:35:59 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 99212 invoked by uid 99); 7 Nov 2017 13:35:59 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Nov 2017 13:35:59 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id B11201A0049; Tue, 7 Nov 2017 13:35:58 +0000 (UTC) X-Quarantine-ID: X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Amavis-Alert: BAD HEADER SECTION, Improper folded header field made up entirely of whitespace (char 20 hex): X-Spam-Report: ...that system for details.\n \n Content previ[...] X-Spam-Flag: NO X-Spam-Score: -0.403 X-Spam-Level: X-Spam-Status: No, score=-0.403 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_MED=-2.3, RP_MATCHES_RCVD=-0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=cs.hacettepe.edu.tr Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id Y49P-865QtEb; Tue, 7 Nov 2017 13:35:56 +0000 (UTC) Received: from cs.hacettepe.edu.tr (email.cs.hacettepe.edu.tr [193.140.236.18]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id B33AC5FD61; Tue, 7 Nov 2017 13:35:55 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=cs.hacettepe.edu.tr; s=cs; h=References:To:Cc:In-Reply-To:Date:Subject: Mime-Version:Content-Type:Message-Id:From:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=1ihvy9lntzib7SEWL5fMjjg3WYlvsXDGzBAFxgKSEuo=; b=TkLf+zJiv7dwbRCwnNBpBNl0h F97VJn9YsUF7a3XqMK8Q+b+37TlWkoLnRGKQ2giD/h9y5q9vfPEIu0ktfqHE5mFj5DjhX4FB5qW9b /gDr/Sl5LLHcbdny269lLjLAnJx9U3xL/LlM4ATN1vlhMVls+l/hmR/jiRW256N83xBus=; Received: from [213.74.192.162] (helo=ebrus-mbp.vcerf.net) by cs.hacettepe.edu.tr with esmtpsa (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.89) (envelope-from ) id 1eC42j-0000jT-9a; Tue, 07 Nov 2017 16:35:54 +0300 From: ebru Message-Id: <30C290C1-3FFD-479D-A926-861B110638BD@cs.hacettepe.edu.tr> Content-Type: multipart/alternative; boundary="Apple-Mail=_CD410D54-FAA3-4EC2-B19C-58AAFB8E5962" Mime-Version: 1.0 (Mac OS X Mail 10.2 \(3259\)) Subject: Re: Flink memory leak Date: Tue, 7 Nov 2017 16:35:51 +0300 In-Reply-To: Cc: user@flink.apache.org, Aljoscha Krettek To: Ufuk Celebi References: <58D3AB1E-821D-46A2-8F04-8BB708037C2B@cs.hacettepe.edu.tr> X-Mailer: Apple Mail (2.3259) archived-at: Tue, 07 Nov 2017 13:36:01 -0000 --Apple-Mail=_CD410D54-FAA3-4EC2-B19C-58AAFB8E5962 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Hi Ufuk, We don=E2=80=99t explicitly define any state descriptor. We only use map = and filters operator. We thought that gc handle clearing the flink=E2=80=99= s internal states.=20 So how can we manage the memory if it is always increasing? - Ebru > On 7 Nov 2017, at 16:23, Ufuk Celebi wrote: >=20 > Hey Ebru, the memory usage might be increasing as long as a job is = running. This is expected (also in the case of multiple running jobs). = The screenshots are not helpful in that regard. :-( >=20 > What kind of stateful operations are you using? Depending on your use = case, you have to manually call `clear()` on the state instance in order = to release the managed state. >=20 > Best, >=20 > Ufuk >=20 > On Tue, Nov 7, 2017 at 12:43 PM, ebru > wrote: >=20 >=20 >> Begin forwarded message: >>=20 >> From: ebru > >> Subject: Re: Flink memory leak >> Date: 7 November 2017 at 14:09:17 GMT+3 >> To: Ufuk Celebi > >>=20 >> Hi Ufuk, >>=20 >> There are there snapshots of htop output. >> 1. snapshot is initial state. >> 2. snapshot is after submitted one job. >> 3. Snapshot is the output of the one job with 15000 EPS. And the = memory usage is always increasing over time. >>=20 >>=20 >>=20 >>=20 >> <1.png><2.png><3.png> >>> On 7 Nov 2017, at 13:34, Ufuk Celebi > wrote: >>>=20 >>> Hey Ebru, >>>=20 >>> let me pull in Aljoscha (CC'd) who might have an idea what's causing = this. >>>=20 >>> Since multiple jobs are running, it will be hard to understand to >>> which job the state descriptors from the heap snapshot belong to. >>> - Is it possible to isolate the problem and reproduce the behaviour >>> with only a single job? >>>=20 >>> =E2=80=93 Ufuk >>>=20 >>>=20 >>> On Tue, Nov 7, 2017 at 10:27 AM, =C3=87ET=C4=B0NKAYA EBRU = =C3=87ET=C4=B0NKAYA EBRU >>> > wrote: >>>> Hi, >>>>=20 >>>> We are using Flink 1.3.1 in production, we have one job manager and = 3 task >>>> managers in standalone mode. Recently, we've noticed that we have = memory >>>> related problems. We use docker container to serve Flink cluster. = We have >>>> 300 slots and 20 jobs are running with parallelism of 10. Also the = job count >>>> may be change over time. Taskmanager memory usage always increases. = After >>>> job cancelation this memory usage doesn't decrease. We've tried to >>>> investigate the problem and we've got the task manager jvm heap = snapshot. >>>> According to the jam heap analysis, possible memory leak was Flink = list >>>> state descriptor. But we are not sure that is the cause of our = memory >>>> problem. How can we solve the problem? >>=20 >=20 >=20 --Apple-Mail=_CD410D54-FAA3-4EC2-B19C-58AAFB8E5962 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 Hi Ufuk,

We= don=E2=80=99t explicitly define any state descriptor. We only use map = and filters operator. We thought that gc handle clearing the flink=E2=80=99= s internal states. 
So how can we manage the = memory if it is always increasing?

- Ebru
On 7 Nov 2017, at 16:23, Ufuk = Celebi <uce@apache.org> wrote:

Hey Ebru, the memory usage might be increasing as long as a = job is running. This is expected (also in the case of multiple running = jobs). The screenshots are not helpful in that regard. :-(

What kind of stateful = operations are you using? Depending on your use case, you have to = manually call `clear()` on the state instance in order to release the = managed state.

Best,

Ufuk

On Tue, Nov 7, 2017 at 12:43 PM, ebru <b20926247@cs.hacettepe.edu.tr> wrote:


Begin forwarded message:

Subject: = Re: Flink memory = leak
Date: = 7 November 2017 at 14:09:17 = GMT+3
To: = Ufuk Celebi <uce@apache.org>

Hi = Ufuk,

There = are there snapshots of htop output.
1. snapshot is = initial state.
2. snapshot is after submitted one = job.
3. Snapshot is the output of the one job with = 15000 EPS. And the memory usage is always increasing over = time.




<1.png><2.png><3.png>
On 7 Nov 2017, at 13:34, Ufuk Celebi <uce@apache.org> wrote:

Hey Ebru,

let me = pull in Aljoscha (CC'd) who might have an idea what's causing this.

Since multiple jobs are running, it will be = hard to understand to
which job the state descriptors from = the heap snapshot belong to.
- Is it possible to isolate = the problem and reproduce the behaviour
with only a single = job?

=E2=80=93 Ufuk


On Tue, Nov 7, 2017 at 10:27 AM, =C3=87ET=C4=B0N= KAYA EBRU =C3=87ET=C4=B0NKAYA EBRU
<b20926247@cs.hacettepe.edu.tr> = wrote:
Hi,

We are using Flink 1.3.1 in production, we = have one job manager and 3 task
managers in standalone = mode. Recently, we've noticed that we have memory
related = problems. We use docker container to serve Flink cluster. We have
300 slots and 20 jobs are running with parallelism of 10. = Also the job count
may be change over time. Taskmanager = memory usage always increases. After
job cancelation this = memory usage doesn't decrease. We've tried to
investigate = the problem and we've got the task manager jvm heap snapshot.
According to the jam heap analysis, possible memory leak was = Flink list
state descriptor. But we are not sure that is = the cause of our memory
problem. How can we solve the = problem?




= --Apple-Mail=_CD410D54-FAA3-4EC2-B19C-58AAFB8E5962--