Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 6BDD0200C86 for ; Wed, 31 May 2017 14:38:20 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 6ABB2160BBA; Wed, 31 May 2017 12:38:20 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id E6EF0160BE4 for ; Wed, 31 May 2017 14:38:18 +0200 (CEST) Received: (qmail 61243 invoked by uid 500); 31 May 2017 12:38:16 -0000 Mailing-List: contact issues-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list issues@flink.apache.org Received: (qmail 61197 invoked by uid 99); 31 May 2017 12:38:16 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 May 2017 12:38:16 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 5C5B818128D for ; Wed, 31 May 2017 12:38:16 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id wZ7_O7ZEDTbp for ; Wed, 31 May 2017 12:38:14 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id A62B46106D for ; Wed, 31 May 2017 12:38:13 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id C112EE0DFE for ; Wed, 31 May 2017 12:38:12 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 9A1A2240BD for ; Wed, 31 May 2017 12:38:11 +0000 (UTC) Date: Wed, 31 May 2017 12:38:11 +0000 (UTC) From: "Robert Metzger (JIRA)" To: issues@flink.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (FLINK-5763) Make savepoints self-contained and relocatable MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 31 May 2017 12:38:20 -0000 [ https://issues.apache.org/jira/browse/FLINK-5763?page=3Dcom.atlassia= n.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Metzger updated FLINK-5763: ---------------------------------- Fix Version/s: (was: 1.3.0) 1.4.0 > Make savepoints self-contained and relocatable > ---------------------------------------------- > > Key: FLINK-5763 > URL: https://issues.apache.org/jira/browse/FLINK-5763 > Project: Flink > Issue Type: Improvement > Components: State Backends, Checkpointing > Reporter: Ufuk Celebi > Fix For: 1.4.0 > > > After a user has triggered a savepoint, a single savepoint file will be r= eturned as a handle to the savepoint. A savepoint to {{}} creates a= savepoint file like {{/savepoint-}}. > This file contains the metadata of the corresponding checkpoint, but not = the actual program state. While this works well for short term management (= pause-and-resume a job), it makes it hard to manage savepoints over longer = periods of time. > h4. Problems > h5. Scattered Checkpoint Files > For file system based checkpoints (FsStateBackend, RocksDBStateBackend) t= his results in the savepoint referencing files from the checkpoint director= y (usually different than ). For users, it is virtually impossible = to tell which checkpoint files belong to a savepoint and which are lingerin= g around. This can easily lead to accidentally invalidating a savepoint by = deleting checkpoint files. > h5. Savepoints Not Relocatable > Even if a user is able to figure out which checkpoint files belong to a s= avepoint, moving these files will invalidate the savepoint as well, because= the metadata file references absolute file paths. > h5. Forced to Use CLI for Disposal > Because of the scattered files, the user is in practice forced to use Fli= nk=E2=80=99s CLI to dispose a savepoint. This should be possible to handle = in the scope of the user=E2=80=99s environment via a file system delete ope= ration. > h4. Proposal > In order to solve the described problems, savepoints should contain all t= heir state, both metadata and program state, inside a single directory. Fur= thermore the metadata must only hold relative references to the checkpoint = files. This makes it obvious which files make up the state of a savepoint a= nd it is possible to move savepoints around by moving the savepoint directo= ry. > h5. Desired File Layout > Triggering a savepoint to {{}} creates a directory as follows: > {code} > /savepoint-- > +-- _metadata > +-- data- [1 or more] > {code} > We include the JobID in the savepoint directory name in order to give som= e hints about which job a savepoint belongs to. > h5. CLI > - Trigger: When triggering a savepoint to {{}} the savepoint dire= ctory will be returned as the handle to the savepoint. > - Restore: Users can restore by pointing to the directory or the _metadat= a file. The data files should be required to be in the same directory as th= e _metadata file. > - Dispose: The disposal command should be deprecated and eventually remov= ed. While deprecated, disposal can happen by specifying the directory or th= e _metadata file (same as restore). -- This message was sent by Atlassian JIRA (v6.3.15#6346)