From user-return-33444-archive-asf-public=cust-asf.ponee.io@flink.apache.org Fri Mar 13 09:10:12 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 06DD818062C for ; Fri, 13 Mar 2020 10:10:11 +0100 (CET) Received: (qmail 23398 invoked by uid 500); 13 Mar 2020 09:10:08 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 23335 invoked by uid 99); 13 Mar 2020 09:10:08 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Mar 2020 09:10:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 0CFCCC030F for ; Fri, 13 Mar 2020 09:10:08 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.332 X-Spam-Level: * X-Spam-Status: No, score=1.332 tagged_above=-999 required=6.31 tests=[KAM_DMARC_NONE=0.25, KAM_DMARC_STATUS=0.01, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_PASS=-0.001, SPF_SOFTFAIL=0.972, URIBL_BLOCKED=0.001, URI_HEX=0.1] autolearn=disabled Received: from mx1-ec2-va.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id ogYzQElcxQv5 for ; Fri, 13 Mar 2020 09:10:06 +0000 (UTC) Received-SPF: Softfail (mailfrom) identity=mailfrom; client-ip=199.38.86.66; helo=n4.nabble.com; envelope-from=mysore.damodaram@microfocus.com; receiver= Received: from n4.nabble.com (n4.nabble.com [199.38.86.66]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTP id 0F87EBB808 for ; Fri, 13 Mar 2020 09:10:05 +0000 (UTC) Received: from n4.nabble.com (localhost [127.0.0.1]) by n4.nabble.com (Postfix) with ESMTP id E79A0119B3850 for ; Fri, 13 Mar 2020 04:10:03 -0500 (CDT) Date: Fri, 13 Mar 2020 04:10:03 -0500 (CDT) From: shravan To: user@flink.apache.org Message-ID: <1584090603946-0.post@n4.nabble.com> Subject: Stop job with savepoint during graceful shutdown on a k8s cluster MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Job Manager , Task Manager are run as separate pods within K8S cluster in our setup. As job cluster is not used, job jars are not part of Job Manager docker image. The job is submitted from a different Flink client pod. Flink is configured with RocksDB state backend. The docker images are created by us as the base OS image needs to be compliant to our organization guidelines. We are looking for a reliable approach to stop the job with savepoint during graceful shutdown to avoid duplicates on restart. The Job Manager pod traps shutdown signal and stops all the jobs with savepoints. The Flink client pod starts the job with savepoint on restart of client pod. But as the order in which pods will be shutdown is not predictable, we have following queries, 1. Our understanding is to stop job with savepoint, all the task manager will persist their state during savepoint. If a Task Manager receives a shutdown signal while savepoint is being taken, does it complete the savepoint before shutdown ? 2. The job manager K8S service is configured as remote job manager address in Task Manager. This service may not be available during savepoint, will this affect the communication between Task Manager and Job Manager during savepoint ? Can you provide some pointers on the internals of savepoint in Flink ? -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/