Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: <1499978622082-14271.post@n4.nabble.com>
References: <CAEv+=qzrwd7GH12JXgfvojj=3Ocp4Qtgy+uuKTXSmgC-xE7Rpg@mail.gmail.com>
 <1499978622082-14271.post@n4.nabble.com>
From: Stephan Ewen <sewen@apache.org>
Date: Fri, 14 Jul 2017 18:31:27 +0200
Message-ID: <CANC1h_vOon_oqZdyq7xHmgi1PG96rSeiHRCcqWbHxJs8AwDiDw@mail.gmail.com>
Subject: Re: S3 recovery and checkpoint directories exhibit explosive growth
To: prashantnayak <prashant@intellifylearning.com>
Cc: user <user@flink.apache.org>, Stefan Richter <s.richter@data-artisans.com>,
	=?UTF-8?B?5pa95pmT572h?= <shixiaogangg@gmail.com>
Content-Type: multipart/alternative; boundary="f403045ee5669ec66905544995d6"
archived-at: Fri, 14 Jul 2017 16:31:47 -0000

--f403045ee5669ec66905544995d6
Content-Type: text/plain; charset="UTF-8"

Hi!

I am looping in Stefan and Xiaogang who worked a lot in incremental
checkpointing.

Some background on incremental checkpoints: Incremental checkpoints store
"pieces" of the state (RocksDB ssTables) that are shared between
checkpoints. Hence it naturally uses more files than no-incremental
checkpoints.

You could help us understand this with a few more details:
  - Does it only occur with incremental checkpoints, or also with regular
checkpoints?
  - How many checkpoints to you retain?
  - Do you use externalized checkpoints?
  - Do you use a highly-available setup with ZooKeeper?

Thanks,
Stephan


On Thu, Jul 13, 2017 at 10:43 PM, prashantnayak <
prashant@intellifylearning.com> wrote:

>
> To add one more data point... it seems like the recovery directory is the
> bottleneck somehow..  so if we delete the recovery directory and restart
> the
> job manager - it comes back and is responsive.
>
> Of course, we lose all jobs, since none can be recovered... and that is of
> course not ideal.
>
> So the question seems to be why the recovery directory grows exponentially
> in the first place.
>
> I can't imagine we're the only ones to see this... or we must be
> configuring
> something wrong while testing Flink 1.3.1
>
> Thanks for your help in advance
>
> Prashant
>
>
>
> --
> View this message in context: http://apache-flink-user-
> mailing-list-archive.2336050.n4.nabble.com/S3-recovery-and-
> checkpoint-directories-exhibit-explosive-growth-tp14270p14271.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>

--f403045ee5669ec66905544995d6
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi!<div><br></div><div>I am looping in Stefan and Xiaogang=
 who worked a lot in incremental checkpointing.</div><div><br></div><div>So=
me background on incremental checkpoints: Incremental checkpoints store &qu=
ot;pieces&quot; of the state (RocksDB ssTables) that are shared between che=
ckpoints. Hence it naturally uses more files than no-incremental checkpoint=
s.=C2=A0</div><div><br></div><div>You could help us understand this with a =
few more details:</div><div>=C2=A0 - Does it only occur with incremental ch=
eckpoints, or also with regular checkpoints?</div><div>=C2=A0 - How many ch=
eckpoints to you retain?</div><div>=C2=A0 - Do you use externalized checkpo=
ints?</div><div>=C2=A0 - Do you use a highly-available setup with ZooKeeper=
?</div><div><br></div><div>Thanks,</div><div>Stephan</div><div><br></div><d=
iv><br></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On T=
hu, Jul 13, 2017 at 10:43 PM, prashantnayak <span dir=3D"ltr">&lt;<a href=
=3D"mailto:prashant@intellifylearning.com" target=3D"_blank">prashant@intel=
lifylearning.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote"=
 style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><b=
r>
To add one more data point... it seems like the recovery directory is the<b=
r>
bottleneck somehow..=C2=A0 so if we delete the recovery directory and resta=
rt the<br>
job manager - it comes back and is responsive.<br>
<br>
Of course, we lose all jobs, since none can be recovered... and that is of<=
br>
course not ideal.<br>
<br>
So the question seems to be why the recovery directory grows exponentially<=
br>
in the first place.<br>
<br>
I can&#39;t imagine we&#39;re the only ones to see this... or we must be co=
nfiguring<br>
something wrong while testing Flink 1.3.1<br>
<br>
Thanks for your help in advance<br>
<br>
Prashant<br>
<br>
<br>
<br>
--<br>
View this message in context: <a href=3D"http://apache-flink-user-mailing-l=
ist-archive.2336050.n4.nabble.com/S3-recovery-and-checkpoint-directories-ex=
hibit-explosive-growth-tp14270p14271.html" rel=3D"noreferrer" target=3D"_bl=
ank">http://apache-flink-user-<wbr>mailing-list-archive.2336050.<wbr>n4.nab=
ble.com/S3-recovery-and-<wbr>checkpoint-directories-<wbr>exhibit-explosive-=
growth-<wbr>tp14270p14271.html</a><br>
Sent from the Apache Flink User Mailing List archive. mailing list archive =
at Nabble.com.<br>
</blockquote></div><br></div></div>

--f403045ee5669ec66905544995d6--