Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 99FCE2007D1 for ; Thu, 12 May 2016 18:39:54 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 98764160939; Thu, 12 May 2016 16:39:54 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id E13F91602BF for ; Thu, 12 May 2016 18:39:53 +0200 (CEST) Received: (qmail 56192 invoked by uid 500); 12 May 2016 16:39:53 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 56183 invoked by uid 99); 12 May 2016 16:39:53 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 May 2016 16:39:53 +0000 Received: from mail-oi0-f53.google.com (mail-oi0-f53.google.com [209.85.218.53]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id AF7801A0146 for ; Thu, 12 May 2016 16:39:52 +0000 (UTC) Received: by mail-oi0-f53.google.com with SMTP id x19so128627967oix.2 for ; Thu, 12 May 2016 09:39:52 -0700 (PDT) X-Gm-Message-State: AOPr4FWuUnVAITuhVIzdA4yOdDtzzN5XM+Mrz87FRcd6eKWdOYyVvyIicmrpNYG76WBzj9NOc29R4pXEEJkXsgMc X-Received: by 10.202.184.6 with SMTP id i6mr6153462oif.76.1463071191978; Thu, 12 May 2016 09:39:51 -0700 (PDT) MIME-Version: 1.0 Received: by 10.157.47.203 with HTTP; Thu, 12 May 2016 09:39:12 -0700 (PDT) In-Reply-To: References: <57343E9D.2050107@touk.pl> From: Ufuk Celebi Date: Thu, 12 May 2016 18:39:12 +0200 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: checkpoints not being removed from HDFS To: Ufuk Celebi Cc: user@flink.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable archived-at: Thu, 12 May 2016 16:39:54 -0000 The issue is here: https://issues.apache.org/jira/browse/FLINK-3902 (My "explanation" before dosn't make sense actually and I don't see a reason why this should be related to having many state handles.) On Thu, May 12, 2016 at 3:54 PM, Ufuk Celebi wrote: > Hey Maciek, > > thanks for reporting this. Having files linger around looks like a bug to= me. > > The idea behind having the recursive flag set to false in the > AbstractFileStateHandle.discardState() call is that the > FileStateHandle is actually just a single file and not a directory. > The second call trying to delete the parent directory only succeeds > when all other files in that directory have been deleted as well. I > think this is what sometimes fails with many state handles. For > RocksDB there is only a single state handle, which works well. > > I will open an issue for this and try to reproduce it reliably and then f= ix it. > > =E2=80=93 Ufuk > > > On Thu, May 12, 2016 at 10:28 AM, Maciek Pr=C3=B3chniak wro= te: >> Hi, >> >> we have stream job with quite large state (few GB), we're using >> FSStateBackend and we're storing checkpoints in hdfs. >> What we observe is that v. often old checkpoints are not discarded prope= rly. >> In hadoop logs I can see: >> >> 2016-05-10 12:21:06,559 INFO BlockStateChange: BLOCK* addToInvalidates: >> blk_1084791727_11053122 10.10.113.10:50010 >> 2016-05-10 12:21:06,559 INFO org.apache.hadoop.ipc.Server: IPC Server >> handler 9 on 8020, call >> org.apache.hadoop.hdfs.protocol.ClientProtocol.delete from 10.10.113.9:4= 9233 >> Call#12337 Retry#0 >> org.apache.hadoop.fs.PathIsNotEmptyDirectoryException: >> `/flink/checkpoints_test/570d6e67d571c109daab468e5678402b/chk-62 is non >> empty': Directory is not empty >> at >> org.apache.hadoop.hdfs.server.namenode.FSDirDeleteOp.delete(FSDirDeleteO= p.java:85) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.= java:3712) >> >> While on flink side (jobmanager log) we don't see any problems: >> 2016-05-10 12:20:22,636 [Checkpoint Timer] INFO >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggeri= ng >> checkpoint 62 @ 1462875622636 >> 2016-05-10 12:20:32,507 [flink-akka.actor.default-dispatcher-240088] INF= O >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed >> checkpoint 62 (in 9843 ms) >> 2016-05-10 12:20:52,637 [Checkpoint Timer] INFO >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggeri= ng >> checkpoint 63 @ 1462875652637 >> 2016-05-10 12:21:06,563 [flink-akka.actor.default-dispatcher-240028] INF= O >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed >> checkpoint 63 (in 13909 ms) >> 2016-05-10 12:21:22,636 [Checkpoint Timer] INFO >> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggeri= ng >> checkpoint 64 @ 1462875682636 >> >> I see in the code that delete operations in flink are done with recursiv= e >> flag set to false - but I'm not sure why the contents are not being dele= ted >> before? >> When we were using RocksDB backed we didn't encounter such situation. >> we're using flink 1.0.1 and hdfs 2.7.2. >> >> Do anybody has any idea why this could be happening? >> >> thanks, >> maciek >> >> >>