Return-Path: X-Original-To: apmail-flink-user-archive@minotaur.apache.org Delivered-To: apmail-flink-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EA49918F70 for ; Wed, 16 Mar 2016 13:50:58 +0000 (UTC) Received: (qmail 72178 invoked by uid 500); 16 Mar 2016 13:50:53 -0000 Delivered-To: apmail-flink-user-archive@flink.apache.org Received: (qmail 72088 invoked by uid 500); 16 Mar 2016 13:50:53 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 72069 invoked by uid 99); 16 Mar 2016 13:50:50 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Mar 2016 13:50:50 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 5EAD3C205E for ; Wed, 16 Mar 2016 13:50:50 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.299 X-Spam-Level: * X-Spam-Status: No, score=1.299 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=radicalbit-io.20150623.gappssmtp.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id ARCZJHSIpU4F for ; Wed, 16 Mar 2016 13:50:49 +0000 (UTC) Received: from mail-lb0-f171.google.com (mail-lb0-f171.google.com [209.85.217.171]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 1E04E5FB3A for ; Wed, 16 Mar 2016 13:50:49 +0000 (UTC) Received: by mail-lb0-f171.google.com with SMTP id oe12so48504294lbc.0 for ; Wed, 16 Mar 2016 06:50:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=radicalbit-io.20150623.gappssmtp.com; s=20150623; h=mime-version:date:message-id:subject:from:to; bh=oj2VI4Px8GjgXPhg/+iOtwxPy7LTNCKaIEsqH4xjOrs=; b=OcrbqB8oSmSGdSiIOWnmMNhUyEB+cw3MwAlqnJCIIsnDgUm21nZhZ+RFXzUeTjYYa3 BrQW72425zrdjJXn1PRvhOzqqMCv+jVxkisgX3aRwDmUGzMWym8Jn3Lvf5w4Hbk6DRRf /NESqlbPzkx9dcrHIcQqSHExKyVhj/U9U7Bs1xwwPL7E1q9XyLvxE8StjY60FSQqdjAA 0sumQDK8VB+S0viM3wkHBOx1KKA9B8H5Cg3DETbbVa2/Kn1qRVCF3ZBL3FZWyqfpmVop yMyJoNli5RXl+0g27YvhqEksysynrrtViOHP3GtzM2siEqq34Nm4dSHIoJjG/r/sYb0L gE9Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:date:message-id:subject:from:to; bh=oj2VI4Px8GjgXPhg/+iOtwxPy7LTNCKaIEsqH4xjOrs=; b=YfHu1SPfyLIZxlTd1RnqWQPa4WO8PpOs896ztapWj1/d0cbSO6vCDt83BtmuH1qqTb WXrjt/czb0WkzfnA8g4h/OC2R/NNxHvx5PGRAisuJpN9qSnqJ+bMZjMfUQQi0geMKCE2 sPICKYK7bd5LkT/NvShny3OT1WkOK6ztJu+YImCsrprFo3YRJfScZIbgmvrpUJUJoWlE d2M7yZk/zTTV7vzsUDBxQaHYDuUWZ152x+aZVgGbuPFnNuz8Sy1LD0PT12eF1G5VrcY4 skeSGal32tGQaQ0WBOtkeBSoiu6nEmcDPuZInKckhHyq9embba4hGOSaTo28qQKDRVo0 AdwA== X-Gm-Message-State: AD7BkJJwdtJfGe/qhelqXnDpvuD3RykfYNMlTdYwIFIgym82FtvlivvcoTg7F0rqWgRxMB6HalSqf2kuHinW5ck9 MIME-Version: 1.0 X-Received: by 10.112.11.225 with SMTP id t1mr1492581lbb.72.1458136242239; Wed, 16 Mar 2016 06:50:42 -0700 (PDT) Received: by 10.25.22.88 with HTTP; Wed, 16 Mar 2016 06:50:42 -0700 (PDT) Date: Wed, 16 Mar 2016 14:50:42 +0100 Message-ID: Subject: Flink Checkpoint on yarn From: Simone Robutti To: user@flink.apache.org Content-Type: multipart/alternative; boundary=001a11c3cd2cc0d393052e2acc1f --001a11c3cd2cc0d393052e2acc1f Content-Type: text/plain; charset=UTF-8 Hello, I'm testing the checkpointing functionality with hdfs as a backend. For what I can see it uses different checkpointing files and resume the computation from different points and not from the latest available. This is to me an unexpected behaviour. I log every second, for every worker, a counter that is increased by 1 at each step. So for example on node-1 the count goes up to 5, then I kill a job manager or task manager and it resumes from 5 or 4 and it's ok. The next time I kill a job manager the count is at 15 and it resumes at 14 or 15. Sometimes it may happen that at a third kill the work resumes at 4 or 5 as if the checkpoint resumed the second time wasn't there. Once I even saw it jump forward: the first kill is at 10 and it resumes at 9, the second kill is at 70 and it resumes at 9, the third kill is at 15 but it resumes at 69 as if it resumed from the second kill checkpoint. This is clearly inconsistent. Also, in the logs I can find that sometimes it uses a checkpoint file different from the previous, consistent resume. What am I doing wrong? Is it a known bug? --001a11c3cd2cc0d393052e2acc1f Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hello,

I'm testing the checkpointin= g functionality with hdfs as a backend.

For what I= can see it uses different checkpointing files and resume the computation f= rom different points and not from the latest available. This is to me an un= expected behaviour.=C2=A0

I log every second, for = every worker, a counter that is increased by 1 at each step.=C2=A0

So for example on node-1 the count goes up to 5, then I ki= ll a job manager or task manager and it resumes from 5 or 4 and it's ok= . The next time I kill a job manager the count is at 15 and it resumes at 1= 4 or 15. Sometimes it may happen that at a third kill the work resumes at 4= or 5 as if the checkpoint resumed the second time wasn't there.
<= div>
Once I even saw it jump forward: the first kill is at 10= and it resumes at 9, the second kill is at 70 and it resumes at 9, the thi= rd kill is at 15 but it resumes at 69 as if it resumed from the second kill= checkpoint.

This is clearly inconsistent.

Also, in the logs I can find that sometimes it uses a che= ckpoint file different from the previous, consistent resume.

=
What am I doing wrong? Is it a known bug?=C2=A0
--001a11c3cd2cc0d393052e2acc1f--