From commits-return-57658-archive-asf-public=cust-asf.ponee.io@beam.apache.org  Thu Feb  1 14:34:08 2018
Return-Path: <commits-return-57658-archive-asf-public=cust-asf.ponee.io@beam.apache.org>
X-Original-To: archive-asf-public@eu.ponee.io
Delivered-To: archive-asf-public@eu.ponee.io
Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183])
	by mx-eu-01.ponee.io (Postfix) with ESMTP id 4B895180652
	for <archive-asf-public@eu.ponee.io>; Thu,  1 Feb 2018 14:34:08 +0100 (CET)
Received: by cust-asf.ponee.io (Postfix)
	id 3BB0D160C26; Thu,  1 Feb 2018 13:34:08 +0000 (UTC)
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by cust-asf.ponee.io (Postfix) with SMTP id 65B75160C44
	for <archive-asf-public@cust-asf.ponee.io>; Thu,  1 Feb 2018 14:34:07 +0100 (CET)
Received: (qmail 42264 invoked by uid 500); 1 Feb 2018 13:34:06 -0000
Mailing-List: contact commits-help@beam.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:commits-help@beam.apache.org>
List-Unsubscribe: <mailto:commits-unsubscribe@beam.apache.org>
List-Post: <mailto:commits@beam.apache.org>
List-Id: <commits.beam.apache.org>
Reply-To: dev@beam.apache.org
Delivered-To: mailing list commits@beam.apache.org
Received: (qmail 42255 invoked by uid 99); 1 Feb 2018 13:34:06 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Feb 2018 13:34:06 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 77682C04AB
	for <commits@beam.apache.org>; Thu,  1 Feb 2018 13:34:05 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: -110.311
X-Spam-Level:
X-Spam-Status: No, score=-110.311 tagged_above=-999 required=6.31
	tests=[ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_MED=-2.3,
	SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_DEF_SPF_WL=-7.5,
	USER_IN_WHITELIST=-100] autolearn=disabled
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024)
	with ESMTP id rHTsZ9PxrAJ9 for <commits@beam.apache.org>;
	Thu,  1 Feb 2018 13:34:03 +0000 (UTC)
Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id CB15E5FB76
	for <commits@beam.apache.org>; Thu,  1 Feb 2018 13:34:03 +0000 (UTC)
Received: from jira-lw-us.apache.org (unknown [207.244.88.139])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 58AF4E01BE
	for <commits@beam.apache.org>; Thu,  1 Feb 2018 13:34:02 +0000 (UTC)
Received: from jira-lw-us.apache.org (localhost [127.0.0.1])
	by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 27B0221E85
	for <commits@beam.apache.org>; Thu,  1 Feb 2018 13:34:00 +0000 (UTC)
Date: Thu, 1 Feb 2018 13:34:00 +0000 (UTC)
From: "Pawel Bartoszek (JIRA)" <jira@apache.org>
To: commits@beam.apache.org
Message-ID: <JIRA.13118215.1510657953000.97722.1517492040161@Atlassian.JIRA>
In-Reply-To: <JIRA.13118215.1510657953000@Atlassian.JIRA>
References: <JIRA.13118215.1510657953000@Atlassian.JIRA> <JIRA.13118215.1510657953016@jira-lw-us.apache.org>
Subject: [jira] [Updated] (BEAM-3186) In-flight data loss when restoring
 from savepoint
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


     [ https://issues.apache.org/jira/browse/BEAM-3186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pawel Bartoszek updated BEAM-3186:
----------------------------------
    Description: 
*The context:*

I want to count how many events of given type(A,B, etc) I receive every minute using 1 minute windows and AfterWatermark trigger with allowed lateness 1 min.

*Data loss case*
In the case below if there is at least one A element with the event time belonging to the window 14:00-14:01 read from Kinesis stream after job is restored from savepoint the data loss will not be observed for this key and this window.
!restore_no_trigger.png!

*Not data loss case*
However, if no new A element element is read from Kinesis stream than data loss is observable.
!restore_with_trigger.png!


*Workaround*
As a workaround we could configure early firings every X seconds which gives up to X seconds data loss per key on restore.

*My guess where the issue might be*

I believe this is Beam-Flink integration layer bug. From my investigation I don't think it's KinesisReader and possibility that it couldn't advance watermark. To prove that after I restore from savepoint I sent some records for different key (B) for the same window as shown in the pictures(14:00-14:01) without seeing trigger going off for restored window and key A.

My guess is that Beam after job is restored doesn't register flink event time timer for restored window unless there is a new element (key) coming for the restored window.

Please refer to [this gist|https://gist.github.com/pbartoszek/7ab88c8b6538039db1b383358d1d1b5a] for test job that shows this behaviour.


  was:
*The context:*

I want to count how many events of given type(A,B, etc) I receive every minute using 1 minute windows and AfterWatermark trigger with allowed lateness 1 min.

*Data loss case*
In the case below if there is at least one A element with the event time belonging to the window 14:00-14:01 read from Kinesis stream after job is restored from savepoint the data loss will not be observed for this key and this window.
!restore_with_trigger.png!

*Not data loss case*
However, if no new A element element is read from Kinesis stream than data loss is observable.

!restore_no_trigger.png!

*Workaround*
As a workaround we could configure early firings every X seconds which gives up to X seconds data loss per key on restore.

*My guess where the issue might be*

I believe this is Beam-Flink integration layer bug. From my investigation I don't think it's KinesisReader and possibility that it couldn't advance watermark. To prove that after I restore from savepoint I sent some records for different key (B) for the same window as shown in the pictures(14:00-14:01) without seeing trigger going off for restored window and key A.

My guess is that Beam after job is restored doesn't register flink event time timer for restored window unless there is a new element (key) coming for the restored window.

Please refer to [this gist|https://gist.github.com/pbartoszek/7ab88c8b6538039db1b383358d1d1b5a] for test job that shows this behaviour.


> In-flight data loss when restoring from savepoint
> -------------------------------------------------
>
>                 Key: BEAM-3186
>                 URL: https://issues.apache.org/jira/browse/BEAM-3186
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-flink
>    Affects Versions: 2.0.0, 2.1.0
>            Reporter: Pawel Bartoszek
>            Assignee: Dawid Wysakowicz
>            Priority: Major
>         Attachments: restore_no_trigger.png, restore_with_trigger.png, restore_with_trigger_b.png
>
>
> *The context:*
> I want to count how many events of given type(A,B, etc) I receive every minute using 1 minute windows and AfterWatermark trigger with allowed lateness 1 min.
> *Data loss case*
> In the case below if there is at least one A element with the event time belonging to the window 14:00-14:01 read from Kinesis stream after job is restored from savepoint the data loss will not be observed for this key and this window.
> !restore_no_trigger.png!
> *Not data loss case*
> However, if no new A element element is read from Kinesis stream than data loss is observable.
> !restore_with_trigger.png!
> *Workaround*
> As a workaround we could configure early firings every X seconds which gives up to X seconds data loss per key on restore.
> *My guess where the issue might be*
> I believe this is Beam-Flink integration layer bug. From my investigation I don't think it's KinesisReader and possibility that it couldn't advance watermark. To prove that after I restore from savepoint I sent some records for different key (B) for the same window as shown in the pictures(14:00-14:01) without seeing trigger going off for restored window and key A.
> My guess is that Beam after job is restored doesn't register flink event time timer for restored window unless there is a new element (key) coming for the restored window.
> Please refer to [this gist|https://gist.github.com/pbartoszek/7ab88c8b6538039db1b383358d1d1b5a] for test job that shows this behaviour.


--
This message was sent by Atlassian JIRA
(v7.6.3#76005)