Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: <1500860811829-14392.post@n4.nabble.com>
References: <CAEv+=qzrwd7GH12JXgfvojj=3Ocp4Qtgy+uuKTXSmgC-xE7Rpg@mail.gmail.com>
 <1499978622082-14271.post@n4.nabble.com> <CANC1h_vOon_oqZdyq7xHmgi1PG96rSeiHRCcqWbHxJs8AwDiDw@mail.gmail.com>
 <CAHM4Zt=PN7pHt_jLb8R2WNXxxLvVWNBCfXb833EbxQ0vTGZUsw@mail.gmail.com>
 <1500604749191-14374.post@n4.nabble.com> <1500605424288-14375.post@n4.nabble.com>
 <1500860811829-14392.post@n4.nabble.com>
From: Stephan Ewen <sewen@apache.org>
Date: Mon, 24 Jul 2017 19:56:21 +0200
Message-ID: <CANC1h_sD=+QD+F2RMXGqqOmqzafp9h55j7VbW2Kc_KaQBcOXmA@mail.gmail.com>
Subject: Re: S3 recovery and checkpoint directories exhibit explosive growth
To: user <user@flink.apache.org>
Cc: prashantnayak <prashant@intellifylearning.com>,
	Stefan Richter <s.richter@data-artisans.com>, =?UTF-8?B?5pa95pmT572h?= <shixiaogangg@gmail.com>
Content-Type: multipart/alternative; boundary="089e082346b4a4f519055513efe3"
archived-at: Mon, 24 Jul 2017 17:56:40 -0000

--089e082346b4a4f519055513efe3
Content-Type: text/plain; charset="UTF-8"

Hi Prashant!

I assume you are using Flink 1.3.0 or 1.3.1?

Here are some things you can do:

  - I would try and disable the incremental checkpointing for a start and
see what happens then. That should reduce the number of files already.

  - Is it possible for you to run a patched version of Flink? If yes, can
you try to do the following: In the class "FileStateHandle", in the method
"discardState()", remove the code around "FileUtils.deletePathIfEmpty(...)"
- this is probably not working well when hitting too many S3 files.

  -  You can delete old "completedCheckpointXXXYYY" files, but please do
not delete the other two types, they are needed for HA recovery.

Greetings,
Stephan


On Mon, Jul 24, 2017 at 3:46 AM, prashantnayak <
prashant@intellifylearning.com> wrote:

> Hi Xiaogang and Stephan
>
> We're continuing to test and have now set up the cluster to disable
> incremental RocksDB checkpointing as well as increasing the checkpoint
> interval from 30s to 120s  (not ideal really :-( )
>
> We'll run it with a large number of jobs and report back if this setup
> shows
> improvement.
>
> Appreciate any another insights you might have around this problem.
>
> Thanks
> Prashant
>
>
>
> --
> View this message in context: http://apache-flink-user-
> mailing-list-archive.2336050.n4.nabble.com/S3-recovery-and-
> checkpoint-directories-exhibit-explosive-growth-tp14270p14392.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>

--089e082346b4a4f519055513efe3
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi=C2=A0<span style=3D"font-size:12.8px">Prashant!</span><=
div><span style=3D"font-size:12.8px"><br></span></div><div><span style=3D"f=
ont-size:12.8px">I assume you are using Flink 1.3.0 or 1.3.1?</span></div><=
div><span style=3D"font-size:12.8px"><br></span></div><div><span style=3D"f=
ont-size:12.8px">Here are some things you can do:</span></div><div><span st=
yle=3D"font-size:12.8px"><br></span></div><div><span style=3D"font-size:12.=
8px">=C2=A0 -=C2=A0</span><span style=3D"font-size:12.8px">I would try and =
disable the incremental checkpointing for a start and see what happens then=
. That should reduce the number of files already.</span></div><div><span st=
yle=3D"font-size:12.8px"><br></span></div><div><span style=3D"font-size:12.=
8px">=C2=A0 - Is it possible for you to run a patched version of Flink? If =
yes, can you try to do the following: In the class &quot;FileStateHandle&qu=
ot;, in the method &quot;discardState()</span><span style=3D"font-size:12.8=
px">&quot;, remove the code around &quot;FileUtils.deletePathIfEmpty(...)&q=
uot; - this is probably not working well when hitting too many S3 files.</s=
pan></div><div><span style=3D"font-size:12.8px"><br></span></div><div><span=
 style=3D"font-size:12.8px">=C2=A0 -=C2=A0</span><span style=3D"font-size:1=
2.8px">=C2=A0You can delete old &quot;</span><span style=3D"font-size:12.8p=
x">completedCheckpointXXXYYY&quot; files, but please do not delete the othe=
r two types, they are needed for HA recovery.</span><span style=3D"font-siz=
e:12.8px"><br></span></div><div><span style=3D"font-size:12.8px"><br></span=
></div><div><span style=3D"font-size:12.8px">Greetings,</span></div><div><s=
pan style=3D"font-size:12.8px">Stephan</span></div><div><span style=3D"font=
-size:12.8px"><br></span></div></div><div class=3D"gmail_extra"><br><div cl=
ass=3D"gmail_quote">On Mon, Jul 24, 2017 at 3:46 AM, prashantnayak <span di=
r=3D"ltr">&lt;<a href=3D"mailto:prashant@intellifylearning.com" target=3D"_=
blank">prashant@intellifylearning.com</a>&gt;</span> wrote:<br><blockquote =
class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid=
;padding-left:1ex">Hi Xiaogang and Stephan<br>
<br>
We&#39;re continuing to test and have now set up the cluster to disable<br>
incremental RocksDB checkpointing as well as increasing the checkpoint<br>
interval from 30s to 120s=C2=A0 (not ideal really :-( )<br>
<br>
We&#39;ll run it with a large number of jobs and report back if this setup =
shows<br>
improvement.<br>
<br>
Appreciate any another insights you might have around this problem.<br>
<br>
Thanks<br>
Prashant<br>
<br>
<br>
<br>
--<br>
View this message in context: <a href=3D"http://apache-flink-user-mailing-l=
ist-archive.2336050.n4.nabble.com/S3-recovery-and-checkpoint-directories-ex=
hibit-explosive-growth-tp14270p14392.html" rel=3D"noreferrer" target=3D"_bl=
ank">http://apache-flink-user-<wbr>mailing-list-archive.2336050.<wbr>n4.nab=
ble.com/S3-recovery-and-<wbr>checkpoint-directories-<wbr>exhibit-explosive-=
growth-<wbr>tp14270p14392.html</a><br>
<div class=3D"HOEnZb"><div class=3D"h5">Sent from the Apache Flink User Mai=
ling List archive. mailing list archive at Nabble.com.<br>
</div></div></blockquote></div><br></div>

--089e082346b4a4f519055513efe3--