From user-return-17707-archive-asf-public=cust-asf.ponee.io@flink.apache.org Fri Jan 19 00:26:50 2018 Return-Path: X-Original-To: archive-asf-public@eu.ponee.io Delivered-To: archive-asf-public@eu.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by mx-eu-01.ponee.io (Postfix) with ESMTP id 75F3D180654 for ; Fri, 19 Jan 2018 00:26:50 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 658AD160C48; Thu, 18 Jan 2018 23:26:50 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 8604E160C26 for ; Fri, 19 Jan 2018 00:26:49 +0100 (CET) Received: (qmail 68102 invoked by uid 500); 18 Jan 2018 23:26:48 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 68092 invoked by uid 99); 18 Jan 2018 23:26:48 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Jan 2018 23:26:48 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id BDB591800FE for ; Thu, 18 Jan 2018 23:26:47 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.88 X-Spam-Level: * X-Spam-Status: No, score=1.88 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id vtoV_gkDGDBQ for ; Thu, 18 Jan 2018 23:26:46 +0000 (UTC) Received: from mail-pg0-f45.google.com (mail-pg0-f45.google.com [74.125.83.45]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id AFD5D5F343 for ; Thu, 18 Jan 2018 23:26:45 +0000 (UTC) Received: by mail-pg0-f45.google.com with SMTP id g16so8990608pgn.7 for ; Thu, 18 Jan 2018 15:26:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=DdIfQvMo5/w8bwG/fW9fYInjn1iOkUQuB4WPNFzBuwk=; b=TmM73jJgN7clQCJ4KVk31sdpQFTMFM1BoOBP6L7g/AwRexcmGnUehIJY8GCoswhMbw eTIZMqWVUa0FCg9jnSS+xteNRlxw6sO6USVKMr4jjxzKyPvP54mnf2rspmLH+x4lk6SH cs7QWuQpFldAVRU6QlF0cbh7TtNcTK0z53uJeMTDzwH+pCA1FtKeCIpxcZRqXO88KNzh wJo9d4P8FHLAQEpAvD+U7lbX4N8YYhMpDHy/b6+5vRQdMtIdOWNeXDsRr80koV5Kj0j3 b65heX0UY6NknAuoXdu/bPrfAddHOVxwnROHwkMMi0WQBV1RleBXJAjaL9Tn+S047cvd GYIQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=DdIfQvMo5/w8bwG/fW9fYInjn1iOkUQuB4WPNFzBuwk=; b=Mx9nNReaF/X20trGlT4qTfA1IBbo/UpL/3XoSWdiJ/0oFuLIUV2zEwfm3gZQh/B5LL msQHktYe4DfUxqQQPkjov9vZdMi1/DAfvK2gjpAyHy6VdbIWcFwp3QwSnrkVa3CZbaPP HTSFLmWjDfbw8zPnd+whqgyABJqCbyu/VDNoZwS7YbUhviMY3AxoFH8Es2+m5Oqq3Km3 qYpbNGmRj3pRmK3iCUkB7+Kb/ZC5dGPTF8+fjY0ot3iqtinaq8Clqb5nOcliZe8DOCKF UGkCH+vYjfWSKKE3taLnr5f6sA1bVa/QDOkHh62pcrLzeOVgfympwb5CNQxDckP3aWw5 wQVw== X-Gm-Message-State: AKwxytc/d9O9z86dakoqP+gKE/gMI3SVtOUSKlns7T0/Mr2pv4GQejCs zJs+XTKkq3LjpCBFZcuTYwAJvKXLmS/Xj42zojGcHg== X-Google-Smtp-Source: ACJfBot4V9PlIP2YZJkaNuaoh5FchXetU4LCWjSQpciFIaDuvy/jrzwI4iReTFLlyPYL01u8KCbYo2ldyqZTymI/2no= X-Received: by 10.101.101.26 with SMTP id x26mr13461361pgv.149.1516318003273; Thu, 18 Jan 2018 15:26:43 -0800 (PST) MIME-Version: 1.0 Received: by 10.100.128.67 with HTTP; Thu, 18 Jan 2018 15:26:42 -0800 (PST) From: jelmer Date: Fri, 19 Jan 2018 00:26:42 +0100 Message-ID: Subject: Starting a job that does not use checkpointing from a savepoint is broken ? To: user Content-Type: multipart/alternative; boundary="089e082c226cf3a4950563154b57" --089e082c226cf3a4950563154b57 Content-Type: text/plain; charset="UTF-8" I ran into a rather annoying issue today while upgrading a flink jobs from flink 1.3.2 to 1.4.0 This particular job does not use checkpointing not state. I followed the instructions at https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/upgrading.html First created a savepoint, upgraded the cluster, then restarted the job from the savepoint. This all went well until later a few hours later one of our kafka nodes dies.This triggered an exception in the job which was subsequently restarted. However instead of picking up where it left off based on the offsets comitted to kafka (which is what should happen according to https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/connectors/kafka.html) the kafka offsets where reset to the point when i made the savepoint 3 hours earlier and so it started reprocessing millions of messages. Needless to say that creating a savepoint for a job without state or checkpoints does not make that much sense. But I would not expect a restart from a savepoint to completely break a job in the case of failure. I created a repository that reproduces the scenario I encountered https://github.com/jelmerk/flink-cancel-restart-job-without-checkpointing Am I misunderstanding anything or should i file a bug for this ? --089e082c226cf3a4950563154b57 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
I ran into a rather annoying issue today while upgrad= ing a=C2=A0 flink jobs from flink 1.3.2 to 1.4.0

This particular job does not use checkpointing not state.

=

First created a savepoint, upgraded the cluster, then resta= rted the job from the savepoint.

This all went wel= l until later a few hours later one of our kafka nodes dies.This triggered = an exception in the job which was subsequently restarted.

However instead of picking up where it left off based on the offset= s comitted to kafka (which is what should happen according to=C2=A0https://ci.apache.org/projects/flink/flink-docs-release-1.4= /dev/connectors/kafka.html)=C2=A0 the kafka offsets where reset to the = point when i made the savepoint 3 hours earlier and so it started reprocess= ing millions of messages.

Needless to say that cre= ating a savepoint for a job without state or checkpoints does not make that= much sense. But I would not expect a restart from a savepoint to completel= y break a job in the case of failure.

I created a = repository that reproduces the scenario I encountered

<= div>https://github.com/jelmerk/flink-cancel-restart-job-without-= checkpointing

Am I misunderstanding anythi= ng or should i file a bug for this ?


--089e082c226cf3a4950563154b57--