Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 9A1CE200CCF for ; Mon, 24 Jul 2017 19:56:40 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 987511657DF; Mon, 24 Jul 2017 17:56:40 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B6F021657DE for ; Mon, 24 Jul 2017 19:56:39 +0200 (CEST) Received: (qmail 30049 invoked by uid 500); 24 Jul 2017 17:56:38 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 30039 invoked by uid 99); 24 Jul 2017 17:56:38 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 24 Jul 2017 17:56:38 +0000 Received: from mail-pg0-f54.google.com (mail-pg0-f54.google.com [74.125.83.54]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id D5B091A0029 for ; Mon, 24 Jul 2017 17:56:37 +0000 (UTC) Received: by mail-pg0-f54.google.com with SMTP id v190so60193219pgv.2 for ; Mon, 24 Jul 2017 10:56:37 -0700 (PDT) X-Gm-Message-State: AIVw112G+81Mc3NXcIi100MdfFNowBvP+ODPVpYRxxQ7P8zwpabwOAAe UINrGC5AHolw19uQ3aL35/s2ojloEA== X-Received: by 10.101.90.197 with SMTP id d5mr16493149pgt.223.1500918996855; Mon, 24 Jul 2017 10:56:36 -0700 (PDT) MIME-Version: 1.0 Received: by 10.100.169.11 with HTTP; Mon, 24 Jul 2017 10:56:21 -0700 (PDT) In-Reply-To: <1500860811829-14392.post@n4.nabble.com> References: <1499978622082-14271.post@n4.nabble.com> <1500604749191-14374.post@n4.nabble.com> <1500605424288-14375.post@n4.nabble.com> <1500860811829-14392.post@n4.nabble.com> From: Stephan Ewen Date: Mon, 24 Jul 2017 19:56:21 +0200 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: S3 recovery and checkpoint directories exhibit explosive growth To: user Cc: prashantnayak , Stefan Richter , =?UTF-8?B?5pa95pmT572h?= Content-Type: multipart/alternative; boundary="089e082346b4a4f519055513efe3" archived-at: Mon, 24 Jul 2017 17:56:40 -0000 --089e082346b4a4f519055513efe3 Content-Type: text/plain; charset="UTF-8" Hi Prashant! I assume you are using Flink 1.3.0 or 1.3.1? Here are some things you can do: - I would try and disable the incremental checkpointing for a start and see what happens then. That should reduce the number of files already. - Is it possible for you to run a patched version of Flink? If yes, can you try to do the following: In the class "FileStateHandle", in the method "discardState()", remove the code around "FileUtils.deletePathIfEmpty(...)" - this is probably not working well when hitting too many S3 files. - You can delete old "completedCheckpointXXXYYY" files, but please do not delete the other two types, they are needed for HA recovery. Greetings, Stephan On Mon, Jul 24, 2017 at 3:46 AM, prashantnayak < prashant@intellifylearning.com> wrote: > Hi Xiaogang and Stephan > > We're continuing to test and have now set up the cluster to disable > incremental RocksDB checkpointing as well as increasing the checkpoint > interval from 30s to 120s (not ideal really :-( ) > > We'll run it with a large number of jobs and report back if this setup > shows > improvement. > > Appreciate any another insights you might have around this problem. > > Thanks > Prashant > > > > -- > View this message in context: http://apache-flink-user- > mailing-list-archive.2336050.n4.nabble.com/S3-recovery-and- > checkpoint-directories-exhibit-explosive-growth-tp14270p14392.html > Sent from the Apache Flink User Mailing List archive. mailing list archive > at Nabble.com. > --089e082346b4a4f519055513efe3 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi=C2=A0Prashant!<= div>
I assume you are using Flink 1.3.0 or 1.3.1?
<= div>
Here are some things you can do:

=C2=A0 -=C2=A0I would try and = disable the incremental checkpointing for a start and see what happens then= . That should reduce the number of files already.

=C2=A0 - Is it possible for you to run a patched version of Flink? If = yes, can you try to do the following: In the class "FileStateHandle&qu= ot;, in the method "discardState()", remove the code around "FileUtils.deletePathIfEmpty(...)&q= uot; - this is probably not working well when hitting too many S3 files.

=C2=A0 -=C2=A0=C2=A0You can delete old "completedCheckpointXXXYYY" files, but please do not delete the othe= r two types, they are needed for HA recovery.

Greetings,
Stephan


On Mon, Jul 24, 2017 at 3:46 AM, prashantnayak <prashant@intellifylearning.com> wrote:
Hi Xiaogang and Stephan

We're continuing to test and have now set up the cluster to disable
incremental RocksDB checkpointing as well as increasing the checkpoint
interval from 30s to 120s=C2=A0 (not ideal really :-( )

We'll run it with a large number of jobs and report back if this setup = shows
improvement.

Appreciate any another insights you might have around this problem.

Thanks
Prashant



--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nab= ble.com/S3-recovery-and-checkpoint-directories-exhibit-explosive-= growth-tp14270p14392.html
Sent from the Apache Flink User Mai= ling List archive. mailing list archive at Nabble.com.

--089e082346b4a4f519055513efe3--