From user-return-30055-archive-asf-public=cust-asf.ponee.io@flink.apache.org  Thu Sep 26 18:16:58 2019
Return-Path: <user-return-30055-archive-asf-public=cust-asf.ponee.io@flink.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id EA05E1804BB
	for <archive-asf-public@cust-asf.ponee.io>; Thu, 26 Sep 2019 20:16:57 +0200 (CEST)
Received: (qmail 53724 invoked by uid 500); 26 Sep 2019 18:16:56 -0000
Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@flink.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@flink.apache.org>
List-Post: <mailto:user@flink.apache.org>
List-Id: <user.flink.apache.org>
Delivered-To: mailing list user@flink.apache.org
Received: (qmail 53714 invoked by uid 99); 26 Sep 2019 18:16:56 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Sep 2019 18:16:56 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id D57561A3338
	for <user@flink.apache.org>; Thu, 26 Sep 2019 18:16:55 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 2.003
X-Spam-Level: **
X-Spam-Status: No, score=2.003 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2,
	RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_NONE=0.001,
	URIBL_BLOCKED=0.001] autolearn=disabled
Authentication-Results: spamd2-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=maalka-com.20150623.gappssmtp.com
Received: from mx1-ec2-va.apache.org ([10.40.0.8])
	by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024)
	with ESMTP id SA47vyzhIRZj for <user@flink.apache.org>;
	Thu, 26 Sep 2019 18:16:51 +0000 (UTC)
Received-SPF: None (mailfrom) identity=mailfrom; client-ip=209.85.167.47; helo=mail-lf1-f47.google.com; envelope-from=clay.teeter@maalka.com; receiver=<UNKNOWN> 
Received: from mail-lf1-f47.google.com (mail-lf1-f47.google.com [209.85.167.47])
	by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 1C695BC877
	for <user@flink.apache.org>; Thu, 26 Sep 2019 18:09:34 +0000 (UTC)
Received: by mail-lf1-f47.google.com with SMTP id t8so2380709lfc.13
        for <user@flink.apache.org>; Thu, 26 Sep 2019 11:09:33 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=maalka-com.20150623.gappssmtp.com; s=20150623;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=60iAAOPqZoJ24fswORUTWgJOfRmxZ7qiqZj8br+yaNM=;
        b=w5zcg71bIskMQlvo1z3Fa9EQW/sU6FvhIXvjEd7vqFDX+TjXLADb5/JnLj9eIef0Ww
         K00TAWpH988ySP9Di9chWZ9/O9cMKtIflIfkMFHgtehRskgBk1+bdAg2iXhE0lVx6fcq
         9iK+E3k35IJAn13OZ/FVvlqMRdYCI8KeMC/W75gwKDryAaTgWRyFEYjJEm7F7to6nsW6
         iWug49thr8P+n0vhPD7AEbYfi94Tr1PE/ClKSO7EDZLx/lC92bPVlMg9cyBFTmH6QRZF
         1xG+kTLhm4EsUb1iUvjb4Gaq+81hIGAkhPF8g42jvqrpDI5zoAyPjosS3cDs5SDBlkMB
         X+Tw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=60iAAOPqZoJ24fswORUTWgJOfRmxZ7qiqZj8br+yaNM=;
        b=mL7s0ZY1N/gQnYL2CjlALxFZD+MIkw9tsI+dCzJjDg4gobzneOZ8l8iCqzrhfgdXIx
         bY3TP0mixEO3CDSL2rDoH+ZQCa1nIhOJ7uyLUKpat66EYSvLdI2wCMjkGnzcUNBMDROR
         r5mPLq/Uokm1qCdITctruwNTqCDut0ubjX81IlrpX9dMdmSvpX9y+0LLhbjto7Q9ljZe
         XTXAtD8m92OntzC0JJOAfaCakJo/MWc2mJTzHpdQ7SDugSRLBm5SuSx9uyx3Q5LtNvxy
         L6inCv+OB6e4dSRe1wHI1IudpTpEfkfS+3ZoyBMFGgQ+ib2URmwRWrE1uDl4ZPE12ftB
         yLUA==
X-Gm-Message-State: APjAAAXt7CshM3CSHalI59ExolHNUA39pSZN9Fbtgoifa0O0lyfcRfUY
	oYsWJa8gVWml3i/6Kfr+t945wcjhyjC4oJ216F8pDw==
X-Google-Smtp-Source: APXvYqx3mrD+eWgacY2KJYNwAoK55WBtHE2yH27CFxW/1pU+IXD2PzzEl6BiKd/0uBePhRO/3FNKZLWmbdcmAVtNqvU=
X-Received: by 2002:a19:f111:: with SMTP id p17mr3105469lfh.187.1569521367160;
 Thu, 26 Sep 2019 11:09:27 -0700 (PDT)
MIME-Version: 1.0
References: <CALmx6r+kZ2-YOV+m_0jjpMTwQz_5qnDEz87pvPnJt6W-nruMPw@mail.gmail.com>
 <CADa8g60ArmQMEm59ZwwNZx=8VvAtu1=HvTot_o4V6Hqb3uuM0Q@mail.gmail.com>
 <CALmx6rJhgXYdbOCRTtetvMvvBgFHmRyrnUvMT2+ggBk2cQZLsg@mail.gmail.com> <CAAdrtT0cVOoVqGRm3C99SqA1a_v2ciu7DfuYrXizXjEFRw9gUw@mail.gmail.com>
In-Reply-To: <CAAdrtT0cVOoVqGRm3C99SqA1a_v2ciu7DfuYrXizXjEFRw9gUw@mail.gmail.com>
From: Clay Teeter <clay.teeter@maalka.com>
Date: Thu, 26 Sep 2019 20:09:16 +0200
Message-ID: <CALmx6rJVpp+YRKfH1qgy+Thh9n0TNN=ofD=087AbT_hmWTTcmw@mail.gmail.com>
Subject: Re: Flink job manager doesn't remove stale checkmarks
To: Fabian Hueske <fhueske@gmail.com>
Cc: Biao Liu <mmyy1110@gmail.com>, user <user@flink.apache.org>
Content-Type: multipart/alternative; boundary="0000000000008eb173059378ab4c"

--0000000000008eb173059378ab4c
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

I see, I'll try turning off incremental checkpoints to see if that helps.

re: Diskspace, i could see a scenario with my application where i could get
10,000+ checkpoints, if the checkpoints are additive.  I'll let you know
what i see.

Thanks!
Clay


On Wed, Sep 25, 2019 at 5:40 PM Fabian Hueske <fhueske@gmail.com> wrote:

> Hi,
>
> You enabled incremental checkpoints.
> This means that parts of older checkpoints that did not change since the
> last checkpoint are not removed because they are still referenced by the
> incremental checkpoints.
> Flink will automatically remove them once they are not needed anymore.
>
> Are you sure that the size of your application's state is not growing too
> large?
>
> Best, Fabian
>
> Am Di., 24. Sept. 2019 um 10:47 Uhr schrieb Clay Teeter <
> clay.teeter@maalka.com>:
>
>> Oh geez,  checkmarks  =3D checkpoints... sorry.
>>
>> What i mean by stale "checkpoints" are checkpoints that should be reaped
>> by: "state.checkpoints.num-retained: 3".
>>
>> What is happening is that directories:
>>   - state.checkpoints.dir: file:///opt/ha/49/checkpoints
>>   - high-availability.storageDir: file:///opt/ha/49/ha
>> are growing with every checkpoint and i'm running out of disk space.
>>
>> On Tue, Sep 24, 2019 at 4:55 AM Biao Liu <mmyy1110@gmail.com> wrote:
>>
>>> Hi Clay,
>>>
>>> Sorry I don't get your point. I'm not sure what the "stale checkmarks"
>>> exactly means. The HA storage and checkpoint directory left after shutt=
ing
>>> down cluster?
>>>
>>> Thanks,
>>> Biao /'b=C9=AA.a=CA=8A/
>>>
>>>
>>>
>>> On Tue, 24 Sep 2019 at 03:12, Clay Teeter <clay.teeter@maalka.com>
>>> wrote:
>>>
>>>> I'm trying to get my standalone cluster to remove stale checkmarks.
>>>>
>>>> The cluster is composed of a single job and task manager backed by
>>>> rocksdb with high availability.
>>>>
>>>> The configuration on both the job and task manager are:
>>>>
>>>> state.backend: rocksdb
>>>> state.checkpoints.dir: file:///opt/ha/49/checkpoints
>>>> state.backend.incremental: true
>>>> state.checkpoints.num-retained: 3
>>>> jobmanager.heap.size: 1024m
>>>> taskmanager.heap.size: 2048m
>>>> taskmanager.numberOfTaskSlots: 24
>>>> parallelism.default: 1
>>>> high-availability.jobmanager.port: 6123
>>>> high-availability.zookeeper.path.root: ********_49
>>>> high-availability: zookeeper
>>>> high-availability.storageDir: file:///opt/ha/49/ha
>>>> high-availability.zookeeper.quorum: ******t:2181
>>>>
>>>> Both machines have access to /opt/ha/49 and /opt/ha/49/checkpoints via
>>>> NFS and are owned by the flink user.  Also, there are no errors that i=
 can
>>>> find.
>>>>
>>>> Does anyone have any ideas that i could try?
>>>>
>>>>

--0000000000008eb173059378ab4c
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I see, I&#39;ll try turning off incremental checkpoints to=
 see if that helps.=C2=A0=C2=A0<div><br></div><div>re: Diskspace, i could s=
ee a scenario with my application where i could get 10,000+ checkpoints, if=
 the checkpoints are additive.=C2=A0 I&#39;ll let you know what i see.=C2=
=A0</div><div><br></div><div>Thanks!</div><div>Clay</div><div><br></div></d=
iv><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On =
Wed, Sep 25, 2019 at 5:40 PM Fabian Hueske &lt;<a href=3D"mailto:fhueske@gm=
ail.com">fhueske@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gma=
il_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,2=
04,204);padding-left:1ex"><div dir=3D"ltr"><div>Hi,</div><div><br></div><di=
v>You enabled incremental checkpoints. <br></div><div>This means that parts=
 of older checkpoints that did not change since the last checkpoint are not=
 removed because they are still referenced by the incremental checkpoints.<=
br></div><div>Flink will automatically remove them once they are not needed=
 anymore.</div><div><br></div><div>Are you sure that the size of your appli=
cation&#39;s state is not growing too large?</div><div><br></div><div>Best,=
 Fabian<br></div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" clas=
s=3D"gmail_attr">Am Di., 24. Sept. 2019 um 10:47=C2=A0Uhr schrieb Clay Teet=
er &lt;<a href=3D"mailto:clay.teeter@maalka.com" target=3D"_blank">clay.tee=
ter@maalka.com</a>&gt;:<br></div><blockquote class=3D"gmail_quote" style=3D=
"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-le=
ft:1ex"><div dir=3D"ltr"><div>Oh geez,=C2=A0 checkmarks=C2=A0 =3D checkpoin=
ts... sorry.=C2=A0</div><div><br></div><div>What i mean by stale &quot;chec=
kpoints&quot; are checkpoints that should be reaped by: &quot;state.checkpo=
ints.num-retained: 3&quot;.=C2=A0=C2=A0</div><div><br></div><div>What is ha=
ppening is that directories:<div>=C2=A0 - state.checkpoints.dir: file:///op=
t/ha/49/checkpoints<br></div></div><div>=C2=A0 - high-availability.storageD=
ir: file:///opt/ha/49/ha</div><div>are growing with every checkpoint and i&=
#39;m running out of disk space.=C2=A0=C2=A0</div></div><br><div class=3D"g=
mail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Tue, Sep 24, 2019 at 4=
:55 AM Biao Liu &lt;<a href=3D"mailto:mmyy1110@gmail.com" target=3D"_blank"=
>mmyy1110@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quot=
e" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204)=
;padding-left:1ex"><div dir=3D"ltr">Hi=C2=A0Clay,<div><br></div><div>Sorry =
I don&#39;t get your point. I&#39;m not sure what the &quot;stale checkmark=
s&quot; exactly means. The HA storage and checkpoint directory left after s=
hutting down cluster?=C2=A0</div><div><div><div dir=3D"ltr" class=3D"gmail-=
m_-8893355148609300365gmail-m_5187772482991770589gmail-m_-67369718562954447=
64gmail_signature"><div dir=3D"ltr"><div><br></div><div>Thanks,</div><div>B=
iao /&#39;b=C9=AA.a=CA=8A/</div><div><br></div></div></div></div><br></div>=
</div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">=
On Tue, 24 Sep 2019 at 03:12, Clay Teeter &lt;<a href=3D"mailto:clay.teeter=
@maalka.com" target=3D"_blank">clay.teeter@maalka.com</a>&gt; wrote:<br></d=
iv><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;bord=
er-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr"><div>=
I&#39;m trying to get my standalone cluster to remove stale checkmarks.</di=
v><div><br></div><div>The cluster is composed of a single job and task mana=
ger backed by rocksdb with high availability.=C2=A0</div><div><br></div><di=
v>The configuration on both the job and task manager are:</div><div><br></d=
iv><div>state.backend: rocksdb<br>state.checkpoints.dir: file:///opt/ha/49/=
checkpoints<br>state.backend.incremental: true<br>state.checkpoints.num-ret=
ained: 3<br>jobmanager.heap.size: 1024m<br>taskmanager.heap.size: 2048m<br>=
taskmanager.numberOfTaskSlots: 24<br>parallelism.default: 1<br>high-availab=
ility.jobmanager.port: 6123<br>high-availability.zookeeper.path.root: *****=
***_49<br>high-availability: zookeeper<br>high-availability.storageDir: fil=
e:///opt/ha/49/ha<br>high-availability.zookeeper.quorum: ******t:2181<br></=
div><div><br></div><div>Both machines have access to /opt/ha/49 and /opt/ha=
/49/checkpoints via NFS and are owned by the flink user.=C2=A0 Also, there =
are no errors that i can find.</div><div><br></div><div>Does anyone have an=
y ideas that i could try?=C2=A0=C2=A0</div><div><br></div></div>
</blockquote></div>
</blockquote></div>
</blockquote></div>
</blockquote></div>

--0000000000008eb173059378ab4c--