Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B746C200CC8 for ; Fri, 14 Jul 2017 18:31:47 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id B5A8316DF10; Fri, 14 Jul 2017 16:31:47 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 076E516DF0E for ; Fri, 14 Jul 2017 18:31:46 +0200 (CEST) Received: (qmail 34443 invoked by uid 500); 14 Jul 2017 16:31:46 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 34432 invoked by uid 99); 14 Jul 2017 16:31:45 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Jul 2017 16:31:45 +0000 Received: from mail-pg0-f43.google.com (mail-pg0-f43.google.com [74.125.83.43]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id B87BB1A0929 for ; Fri, 14 Jul 2017 16:31:44 +0000 (UTC) Received: by mail-pg0-f43.google.com with SMTP id t186so48219398pgb.1 for ; Fri, 14 Jul 2017 09:31:44 -0700 (PDT) X-Gm-Message-State: AIVw113Piy4DKIfEvIzdPIG2GWKQS2Enb1Ry3QeWDIETVaFrCWeblrKx PU/qPv8GtVzMbCooyNs+w9GRXlz9jQ== X-Received: by 10.84.241.4 with SMTP id a4mr17045588pll.160.1500049903103; Fri, 14 Jul 2017 09:31:43 -0700 (PDT) MIME-Version: 1.0 Received: by 10.100.169.11 with HTTP; Fri, 14 Jul 2017 09:31:27 -0700 (PDT) In-Reply-To: <1499978622082-14271.post@n4.nabble.com> References: <1499978622082-14271.post@n4.nabble.com> From: Stephan Ewen Date: Fri, 14 Jul 2017 18:31:27 +0200 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: S3 recovery and checkpoint directories exhibit explosive growth To: prashantnayak Cc: user , Stefan Richter , =?UTF-8?B?5pa95pmT572h?= Content-Type: multipart/alternative; boundary="f403045ee5669ec66905544995d6" archived-at: Fri, 14 Jul 2017 16:31:47 -0000 --f403045ee5669ec66905544995d6 Content-Type: text/plain; charset="UTF-8" Hi! I am looping in Stefan and Xiaogang who worked a lot in incremental checkpointing. Some background on incremental checkpoints: Incremental checkpoints store "pieces" of the state (RocksDB ssTables) that are shared between checkpoints. Hence it naturally uses more files than no-incremental checkpoints. You could help us understand this with a few more details: - Does it only occur with incremental checkpoints, or also with regular checkpoints? - How many checkpoints to you retain? - Do you use externalized checkpoints? - Do you use a highly-available setup with ZooKeeper? Thanks, Stephan On Thu, Jul 13, 2017 at 10:43 PM, prashantnayak < prashant@intellifylearning.com> wrote: > > To add one more data point... it seems like the recovery directory is the > bottleneck somehow.. so if we delete the recovery directory and restart > the > job manager - it comes back and is responsive. > > Of course, we lose all jobs, since none can be recovered... and that is of > course not ideal. > > So the question seems to be why the recovery directory grows exponentially > in the first place. > > I can't imagine we're the only ones to see this... or we must be > configuring > something wrong while testing Flink 1.3.1 > > Thanks for your help in advance > > Prashant > > > > -- > View this message in context: http://apache-flink-user- > mailing-list-archive.2336050.n4.nabble.com/S3-recovery-and- > checkpoint-directories-exhibit-explosive-growth-tp14270p14271.html > Sent from the Apache Flink User Mailing List archive. mailing list archive > at Nabble.com. > --f403045ee5669ec66905544995d6 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi!

I am looping in Stefan and Xiaogang= who worked a lot in incremental checkpointing.

So= me background on incremental checkpoints: Incremental checkpoints store &qu= ot;pieces" of the state (RocksDB ssTables) that are shared between che= ckpoints. Hence it naturally uses more files than no-incremental checkpoint= s.=C2=A0

You could help us understand this with a = few more details:
=C2=A0 - Does it only occur with incremental ch= eckpoints, or also with regular checkpoints?
=C2=A0 - How many ch= eckpoints to you retain?
=C2=A0 - Do you use externalized checkpo= ints?
=C2=A0 - Do you use a highly-available setup with ZooKeeper= ?

Thanks,
Stephan



On T= hu, Jul 13, 2017 at 10:43 PM, prashantnayak <prashant@intel= lifylearning.com> wrote:
To add one more data point... it seems like the recovery directory is the bottleneck somehow..=C2=A0 so if we delete the recovery directory and resta= rt the
job manager - it comes back and is responsive.

Of course, we lose all jobs, since none can be recovered... and that is of<= br> course not ideal.

So the question seems to be why the recovery directory grows exponentially<= br> in the first place.

I can't imagine we're the only ones to see this... or we must be co= nfiguring
something wrong while testing Flink 1.3.1

Thanks for your help in advance

Prashant



--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nab= ble.com/S3-recovery-and-checkpoint-directories-exhibit-explosive-= growth-tp14270p14271.html
Sent from the Apache Flink User Mailing List archive. mailing list archive = at Nabble.com.

--f403045ee5669ec66905544995d6--