Return-Path: X-Original-To: apmail-flink-dev-archive@www.apache.org Delivered-To: apmail-flink-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EFBC818326 for ; Tue, 14 Jul 2015 15:39:11 +0000 (UTC) Received: (qmail 14955 invoked by uid 500); 14 Jul 2015 15:39:11 -0000 Delivered-To: apmail-flink-dev-archive@flink.apache.org Received: (qmail 14870 invoked by uid 500); 14 Jul 2015 15:39:11 -0000 Mailing-List: contact dev-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list dev@flink.apache.org Received: (qmail 14855 invoked by uid 99); 14 Jul 2015 15:39:11 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Jul 2015 15:39:11 +0000 Date: Tue, 14 Jul 2015 15:39:11 +0000 (UTC) From: "Ufuk Celebi (JIRA)" To: dev@flink.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (FLINK-2356) Resource leak in checkpoint coordinator MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Ufuk Celebi created FLINK-2356: ---------------------------------- Summary: Resource leak in checkpoint coordinator Key: FLINK-2356 URL: https://issues.apache.org/jira/browse/FLINK-2356 Project: Flink Issue Type: Bug Components: JobManager, Streaming Affects Versions: 0.9, master Reporter: Ufuk Celebi Fix For: 0.10, 0.9.1 The shutdown method of the checkpoint coordinator is not called when a Flink cluster is shutdown via SIGINT. The issue is that the checkpoint coordinator shutdown/cleanup is only called after the job enters a final state. This does not happen for regular cluster shutdown (via kill). Because we don't have proper stopping of streaming jobs, this means that every program using checkpointing is suffering from this. I've tested this only locally for now with a custom WordCount checkpointing the current count. When stopping the process, the files still exist. Since this is the same mechanism as in a distributed setup with HDFS, this should mean that files in HDFS will be lingering around. The problem is that the postStop method of the JM actor is not called when shutting down. The task manager components, which need to do resource cleanup register custom shutdown hooks and don't rely on a shutdown call from the task manager. For 0.9.1 we need to make sure that the state is simply cleaned up with a shutdown hook (as in the blob manager). For 0.10 with HA we need to be more careful and not clean it up when other job manager instances need access. See FLINK-2354 for details. -- This message was sent by Atlassian JIRA (v6.3.4#6332)