Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 4C4FA200BD4 for ; Fri, 16 Dec 2016 16:33:13 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 4AC4A160B24; Fri, 16 Dec 2016 15:33:13 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 6C407160B10 for ; Fri, 16 Dec 2016 16:33:12 +0100 (CET) Received: (qmail 29473 invoked by uid 500); 16 Dec 2016 15:33:09 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 29420 invoked by uid 99); 16 Dec 2016 15:33:09 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Dec 2016 15:33:08 +0000 Received: from mail-io0-f171.google.com (mail-io0-f171.google.com [209.85.223.171]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id AB5851A028B for ; Fri, 16 Dec 2016 15:33:08 +0000 (UTC) Received: by mail-io0-f171.google.com with SMTP id 136so100055395iou.3 for ; Fri, 16 Dec 2016 07:33:08 -0800 (PST) X-Gm-Message-State: AIkVDXJGyrd3yhQmvKEA54zRJMdeggzCaHc1tzVapCqLlo6U4LfhAAAnDDkg9ca9Hou37FvXBkPc/4+1nUYMkQ== X-Received: by 10.107.170.231 with SMTP id g100mr3340789ioj.32.1481902388150; Fri, 16 Dec 2016 07:33:08 -0800 (PST) MIME-Version: 1.0 Received: by 10.107.48.18 with HTTP; Fri, 16 Dec 2016 07:33:07 -0800 (PST) In-Reply-To: <3430EDC0-BF77-4A7E-8EE6-C86C131D0B68@bol.com> References: <3430EDC0-BF77-4A7E-8EE6-C86C131D0B68@bol.com> From: Stephan Ewen Date: Fri, 16 Dec 2016 16:33:07 +0100 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Question about expired checkpoints To: user@flink.apache.org Content-Type: multipart/alternative; boundary=001a1142e6b2700b6f0543c8491c archived-at: Fri, 16 Dec 2016 15:33:13 -0000 --001a1142e6b2700b6f0543c8491c Content-Type: text/plain; charset=UTF-8 Hi Nick! In general, checkpoints cannot overtake each other. It can happen (in the presence of failure/recovery) that a checkpoint is "half complete" and subsumed by a newer complete checkpoint. The message "Checkpoint 17 expired before completing" might be correct - you could check the start time of the checkpoint and timeout setting to validate that. In general, the "notifyCheckpointComplete" method is not guaranteed to be called for every checkpoint (it can be skipped in presence of failure/recovery). Stephan On Fri, Dec 16, 2016 at 4:24 PM, Nick Tinnemeier wrote: > Hi all, > > > > I am currently playing around with checkpoints to better understand how > they work. I have some questions I hope you can answer. > > I am running a simple topology with a source, a map and a sink that writes > the events it receives to a HBase table. The parallelism of the environment > is set to 10. Moreover, we set the parallelism of the checkpoints to 20. > The source is a custom one, implementing CheckpointListener interface. In > the notifyCheckpointComplete method of the source we simply log the > checkpoint id of each checkpoint that is being notified. > > > > When I run the application, I notice that sometimes not all checkpoints > are notified. Indeed, when I look into the logs of the application I see > messages like: > > "Checkpoint 17 expired before completing" > > > > I am sure the checkpoint did not timeout. What exactly does this mean and > when does it happen? It makes me wonder if this means that checkpoints can > actually overtake each other. In other words, can it actually happen that a > checkpoint x arrives sooner at the sink than a checkpoint x-1 that was sent > earlier than x? > > > > Kind regards, > > Nick. > > > --001a1142e6b2700b6f0543c8491c Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Nick!

In general, checkpoints cannot= overtake each other. It can happen (in the presence of failure/recovery) t= hat a checkpoint is "half complete" and subsumed by a newer compl= ete checkpoint.
The message "Checkpoint 17 expired before complet= ing" might be correct - you could check the start time of the checkpoi= nt and timeout setting to validate that.

In general, the &quo= t;notifyCheckpointComplete" method is not guaranteed to be called for = every checkpoint (it can be skipped in presence of failure/recovery).
=
Stephan


On Fri, Dec 16, 2016 at 4:24 PM, Nick Tinne= meier <ntinnemeier@bol.com> wrote:

Hi all,

=C2=A0

I am currently play= ing around with checkpoints to better understand how they work. I have some= questions I hope you can answer.

I am running a simp= le topology with a source, a map and a sink that writes the events it recei= ves to a HBase table. The parallelism of the environment is set to 10. More= over, we set the parallelism of the checkpoints to 20. The source is a custom one, implementing CheckpointList= ener interface. In the notifyCheckpointComplete method of the source we sim= ply log the checkpoint id of each checkpoint that is being notified.=

=C2=A0

When I run the appl= ication, I notice that sometimes not all checkpoints are notified. Indeed, = when I look into the logs of the application I see messages like:=

"Checkpoint 17= expired before completing"

=C2=A0

I am sure the check= point did not timeout. What exactly does this mean and when does it happen?= It makes me wonder if this means that checkpoints can actually overtake ea= ch other. In other words, can it actually happen that a checkpoint x arrives sooner at the sink than a checkpoint x-= 1 that was sent earlier than x?

=C2=A0

Kind regards,

Nick.=

=C2=A0


--001a1142e6b2700b6f0543c8491c--