From user-return-30055-archive-asf-public=cust-asf.ponee.io@flink.apache.org Thu Sep 26 18:16:58 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id EA05E1804BB for ; Thu, 26 Sep 2019 20:16:57 +0200 (CEST) Received: (qmail 53724 invoked by uid 500); 26 Sep 2019 18:16:56 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 53714 invoked by uid 99); 26 Sep 2019 18:16:56 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Sep 2019 18:16:56 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id D57561A3338 for ; Thu, 26 Sep 2019 18:16:55 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.003 X-Spam-Level: ** X-Spam-Status: No, score=2.003 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=maalka-com.20150623.gappssmtp.com Received: from mx1-ec2-va.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id SA47vyzhIRZj for ; Thu, 26 Sep 2019 18:16:51 +0000 (UTC) Received-SPF: None (mailfrom) identity=mailfrom; client-ip=209.85.167.47; helo=mail-lf1-f47.google.com; envelope-from=clay.teeter@maalka.com; receiver= Received: from mail-lf1-f47.google.com (mail-lf1-f47.google.com [209.85.167.47]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 1C695BC877 for ; Thu, 26 Sep 2019 18:09:34 +0000 (UTC) Received: by mail-lf1-f47.google.com with SMTP id t8so2380709lfc.13 for ; Thu, 26 Sep 2019 11:09:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=maalka-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=60iAAOPqZoJ24fswORUTWgJOfRmxZ7qiqZj8br+yaNM=; b=w5zcg71bIskMQlvo1z3Fa9EQW/sU6FvhIXvjEd7vqFDX+TjXLADb5/JnLj9eIef0Ww K00TAWpH988ySP9Di9chWZ9/O9cMKtIflIfkMFHgtehRskgBk1+bdAg2iXhE0lVx6fcq 9iK+E3k35IJAn13OZ/FVvlqMRdYCI8KeMC/W75gwKDryAaTgWRyFEYjJEm7F7to6nsW6 iWug49thr8P+n0vhPD7AEbYfi94Tr1PE/ClKSO7EDZLx/lC92bPVlMg9cyBFTmH6QRZF 1xG+kTLhm4EsUb1iUvjb4Gaq+81hIGAkhPF8g42jvqrpDI5zoAyPjosS3cDs5SDBlkMB X+Tw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=60iAAOPqZoJ24fswORUTWgJOfRmxZ7qiqZj8br+yaNM=; b=mL7s0ZY1N/gQnYL2CjlALxFZD+MIkw9tsI+dCzJjDg4gobzneOZ8l8iCqzrhfgdXIx bY3TP0mixEO3CDSL2rDoH+ZQCa1nIhOJ7uyLUKpat66EYSvLdI2wCMjkGnzcUNBMDROR r5mPLq/Uokm1qCdITctruwNTqCDut0ubjX81IlrpX9dMdmSvpX9y+0LLhbjto7Q9ljZe XTXAtD8m92OntzC0JJOAfaCakJo/MWc2mJTzHpdQ7SDugSRLBm5SuSx9uyx3Q5LtNvxy L6inCv+OB6e4dSRe1wHI1IudpTpEfkfS+3ZoyBMFGgQ+ib2URmwRWrE1uDl4ZPE12ftB yLUA== X-Gm-Message-State: APjAAAXt7CshM3CSHalI59ExolHNUA39pSZN9Fbtgoifa0O0lyfcRfUY oYsWJa8gVWml3i/6Kfr+t945wcjhyjC4oJ216F8pDw== X-Google-Smtp-Source: APXvYqx3mrD+eWgacY2KJYNwAoK55WBtHE2yH27CFxW/1pU+IXD2PzzEl6BiKd/0uBePhRO/3FNKZLWmbdcmAVtNqvU= X-Received: by 2002:a19:f111:: with SMTP id p17mr3105469lfh.187.1569521367160; Thu, 26 Sep 2019 11:09:27 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Clay Teeter Date: Thu, 26 Sep 2019 20:09:16 +0200 Message-ID: Subject: Re: Flink job manager doesn't remove stale checkmarks To: Fabian Hueske Cc: Biao Liu , user Content-Type: multipart/alternative; boundary="0000000000008eb173059378ab4c" --0000000000008eb173059378ab4c Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable I see, I'll try turning off incremental checkpoints to see if that helps. re: Diskspace, i could see a scenario with my application where i could get 10,000+ checkpoints, if the checkpoints are additive. I'll let you know what i see. Thanks! Clay On Wed, Sep 25, 2019 at 5:40 PM Fabian Hueske wrote: > Hi, > > You enabled incremental checkpoints. > This means that parts of older checkpoints that did not change since the > last checkpoint are not removed because they are still referenced by the > incremental checkpoints. > Flink will automatically remove them once they are not needed anymore. > > Are you sure that the size of your application's state is not growing too > large? > > Best, Fabian > > Am Di., 24. Sept. 2019 um 10:47 Uhr schrieb Clay Teeter < > clay.teeter@maalka.com>: > >> Oh geez, checkmarks =3D checkpoints... sorry. >> >> What i mean by stale "checkpoints" are checkpoints that should be reaped >> by: "state.checkpoints.num-retained: 3". >> >> What is happening is that directories: >> - state.checkpoints.dir: file:///opt/ha/49/checkpoints >> - high-availability.storageDir: file:///opt/ha/49/ha >> are growing with every checkpoint and i'm running out of disk space. >> >> On Tue, Sep 24, 2019 at 4:55 AM Biao Liu wrote: >> >>> Hi Clay, >>> >>> Sorry I don't get your point. I'm not sure what the "stale checkmarks" >>> exactly means. The HA storage and checkpoint directory left after shutt= ing >>> down cluster? >>> >>> Thanks, >>> Biao /'b=C9=AA.a=CA=8A/ >>> >>> >>> >>> On Tue, 24 Sep 2019 at 03:12, Clay Teeter >>> wrote: >>> >>>> I'm trying to get my standalone cluster to remove stale checkmarks. >>>> >>>> The cluster is composed of a single job and task manager backed by >>>> rocksdb with high availability. >>>> >>>> The configuration on both the job and task manager are: >>>> >>>> state.backend: rocksdb >>>> state.checkpoints.dir: file:///opt/ha/49/checkpoints >>>> state.backend.incremental: true >>>> state.checkpoints.num-retained: 3 >>>> jobmanager.heap.size: 1024m >>>> taskmanager.heap.size: 2048m >>>> taskmanager.numberOfTaskSlots: 24 >>>> parallelism.default: 1 >>>> high-availability.jobmanager.port: 6123 >>>> high-availability.zookeeper.path.root: ********_49 >>>> high-availability: zookeeper >>>> high-availability.storageDir: file:///opt/ha/49/ha >>>> high-availability.zookeeper.quorum: ******t:2181 >>>> >>>> Both machines have access to /opt/ha/49 and /opt/ha/49/checkpoints via >>>> NFS and are owned by the flink user. Also, there are no errors that i= can >>>> find. >>>> >>>> Does anyone have any ideas that i could try? >>>> >>>> --0000000000008eb173059378ab4c Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
I see, I'll try turning off incremental checkpoints to= see if that helps.=C2=A0=C2=A0

re: Diskspace, i could s= ee a scenario with my application where i could get 10,000+ checkpoints, if= the checkpoints are additive.=C2=A0 I'll let you know what i see.=C2= =A0

Thanks!
Clay


On = Wed, Sep 25, 2019 at 5:40 PM Fabian Hueske <fhueske@gmail.com> wrote:
Hi,

You enabled incremental checkpoints.
This means that parts= of older checkpoints that did not change since the last checkpoint are not= removed because they are still referenced by the incremental checkpoints.<= br>
Flink will automatically remove them once they are not needed= anymore.

Are you sure that the size of your appli= cation's state is not growing too large?

Best,= Fabian

Am Di., 24. Sept. 2019 um 10:47=C2=A0Uhr schrieb Clay Teet= er <clay.tee= ter@maalka.com>:
Oh geez,=C2=A0 checkmarks=C2=A0 =3D checkpoin= ts... sorry.=C2=A0

What i mean by stale "chec= kpoints" are checkpoints that should be reaped by: "state.checkpo= ints.num-retained: 3".=C2=A0=C2=A0

What is ha= ppening is that directories:
=C2=A0 - state.checkpoints.dir: file:///op= t/ha/49/checkpoints
=C2=A0 - high-availability.storageD= ir: file:///opt/ha/49/ha
are growing with every checkpoint and i&= #39;m running out of disk space.=C2=A0=C2=A0

On Tue, Sep 24, 2019 at 4= :55 AM Biao Liu <mmyy1110@gmail.com> wrote:
Hi=C2=A0Clay,

Sorry = I don't get your point. I'm not sure what the "stale checkmark= s" exactly means. The HA storage and checkpoint directory left after s= hutting down cluster?=C2=A0

Thanks,
B= iao /'b=C9=AA.a=CA=8A/


=

= On Tue, 24 Sep 2019 at 03:12, Clay Teeter <clay.teeter@maalka.com> wrote:
= I'm trying to get my standalone cluster to remove stale checkmarks.

The cluster is composed of a single job and task mana= ger backed by rocksdb with high availability.=C2=A0

The configuration on both the job and task manager are:

state.backend: rocksdb
state.checkpoints.dir: file:///opt/ha/49/= checkpoints
state.backend.incremental: true
state.checkpoints.num-ret= ained: 3
jobmanager.heap.size: 1024m
taskmanager.heap.size: 2048m
= taskmanager.numberOfTaskSlots: 24
parallelism.default: 1
high-availab= ility.jobmanager.port: 6123
high-availability.zookeeper.path.root: *****= ***_49
high-availability: zookeeper
high-availability.storageDir: fil= e:///opt/ha/49/ha
high-availability.zookeeper.quorum: ******t:2181

Both machines have access to /opt/ha/49 and /opt/ha= /49/checkpoints via NFS and are owned by the flink user.=C2=A0 Also, there = are no errors that i can find.

Does anyone have an= y ideas that i could try?=C2=A0=C2=A0

--0000000000008eb173059378ab4c--