Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id EB2A22007D1 for ; Thu, 12 May 2016 21:28:43 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id E9C87160939; Thu, 12 May 2016 19:28:43 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 3BBDF1602BF for ; Thu, 12 May 2016 21:28:43 +0200 (CEST) Received: (qmail 27227 invoked by uid 500); 12 May 2016 19:28:42 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 27213 invoked by uid 99); 12 May 2016 19:28:42 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 May 2016 19:28:42 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id C936B180226; Thu, 12 May 2016 19:28:41 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.002 X-Spam-Level: X-Spam-Status: No, score=-0.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001] autolearn=disabled Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id y2SX4Az_NuEc; Thu, 12 May 2016 19:28:40 +0000 (UTC) Received: from mx.touk.pl (mx.touk.pl [212.180.179.38]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id 4DAEF5F4EE; Thu, 12 May 2016 19:28:39 +0000 (UTC) Subject: Re: checkpoints not being removed from HDFS To: user@flink.apache.org, Ufuk Celebi References: <57343E9D.2050107@touk.pl> From: =?UTF-8?Q?Maciek_Pr=c3=b3chniak?= Message-ID: <5734D95E.9070806@touk.pl> Date: Thu, 12 May 2016 21:28:30 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.7.2 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit archived-at: Thu, 12 May 2016 19:28:44 -0000 thanks, I'll try to reproduce it in some test by myself... maciek On 12/05/2016 18:39, Ufuk Celebi wrote: > The issue is here: https://issues.apache.org/jira/browse/FLINK-3902 > > (My "explanation" before dosn't make sense actually and I don't see a > reason why this should be related to having many state handles.) > > On Thu, May 12, 2016 at 3:54 PM, Ufuk Celebi wrote: >> Hey Maciek, >> >> thanks for reporting this. Having files linger around looks like a bug to me. >> >> The idea behind having the recursive flag set to false in the >> AbstractFileStateHandle.discardState() call is that the >> FileStateHandle is actually just a single file and not a directory. >> The second call trying to delete the parent directory only succeeds >> when all other files in that directory have been deleted as well. I >> think this is what sometimes fails with many state handles. For >> RocksDB there is only a single state handle, which works well. >> >> I will open an issue for this and try to reproduce it reliably and then fix it. >> >> – Ufuk >> >> >> On Thu, May 12, 2016 at 10:28 AM, Maciek Próchniak wrote: >>> Hi, >>> >>> we have stream job with quite large state (few GB), we're using >>> FSStateBackend and we're storing checkpoints in hdfs. >>> What we observe is that v. often old checkpoints are not discarded properly. >>> In hadoop logs I can see: >>> >>> 2016-05-10 12:21:06,559 INFO BlockStateChange: BLOCK* addToInvalidates: >>> blk_1084791727_11053122 10.10.113.10:50010 >>> 2016-05-10 12:21:06,559 INFO org.apache.hadoop.ipc.Server: IPC Server >>> handler 9 on 8020, call >>> org.apache.hadoop.hdfs.protocol.ClientProtocol.delete from 10.10.113.9:49233 >>> Call#12337 Retry#0 >>> org.apache.hadoop.fs.PathIsNotEmptyDirectoryException: >>> `/flink/checkpoints_test/570d6e67d571c109daab468e5678402b/chk-62 is non >>> empty': Directory is not empty >>> at >>> org.apache.hadoop.hdfs.server.namenode.FSDirDeleteOp.delete(FSDirDeleteOp.java:85) >>> at >>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:3712) >>> >>> While on flink side (jobmanager log) we don't see any problems: >>> 2016-05-10 12:20:22,636 [Checkpoint Timer] INFO >>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering >>> checkpoint 62 @ 1462875622636 >>> 2016-05-10 12:20:32,507 [flink-akka.actor.default-dispatcher-240088] INFO >>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed >>> checkpoint 62 (in 9843 ms) >>> 2016-05-10 12:20:52,637 [Checkpoint Timer] INFO >>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering >>> checkpoint 63 @ 1462875652637 >>> 2016-05-10 12:21:06,563 [flink-akka.actor.default-dispatcher-240028] INFO >>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed >>> checkpoint 63 (in 13909 ms) >>> 2016-05-10 12:21:22,636 [Checkpoint Timer] INFO >>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering >>> checkpoint 64 @ 1462875682636 >>> >>> I see in the code that delete operations in flink are done with recursive >>> flag set to false - but I'm not sure why the contents are not being deleted >>> before? >>> When we were using RocksDB backed we didn't encounter such situation. >>> we're using flink 1.0.1 and hdfs 2.7.2. >>> >>> Do anybody has any idea why this could be happening? >>> >>> thanks, >>> maciek >>> >>> >>>