Return-Path: X-Original-To: apmail-flink-user-archive@minotaur.apache.org Delivered-To: apmail-flink-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AEACC18E9E for ; Tue, 1 Mar 2016 17:56:18 +0000 (UTC) Received: (qmail 76976 invoked by uid 500); 1 Mar 2016 17:56:18 -0000 Delivered-To: apmail-flink-user-archive@flink.apache.org Received: (qmail 76881 invoked by uid 500); 1 Mar 2016 17:56:18 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 76870 invoked by uid 99); 1 Mar 2016 17:56:18 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Mar 2016 17:56:18 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id A07421A0627 for ; Tue, 1 Mar 2016 17:56:17 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.179 X-Spam-Level: ** X-Spam-Status: No, score=2.179 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RDNS_NONE=3, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id wZQ_cc_WGnsi for ; Tue, 1 Mar 2016 17:56:16 +0000 (UTC) Received: from mail-wm0-f47.google.com (unknown [74.125.82.47]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id ABC775FBFF for ; Tue, 1 Mar 2016 17:47:44 +0000 (UTC) Received: by mail-wm0-f47.google.com with SMTP id l68so46436219wml.0 for ; Tue, 01 Mar 2016 09:47:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to; bh=oPjGFP7WIOyic43Wlj8BcaSW65EVgyQEN5ZPESgc1ro=; b=sxwOwdK6xrFxcou068sC4uXtkS1Rrx75NtJwoOJDK+Ob05lwJ6TSE1rC0JjT5xicy6 ZzPpa6r3bs44U7d0Rb67JKDNdEcuQ22dZYUkEqp2Sr/TvaBibga732uhcmt2BM1vrydF fKN+F4K23KUTVcpegohVUO+jEC+dBLQjaZDla8yu2EQsn21hT6meM/sdZL+Mukt+yLbD MIg5cTttwUBXVsCQbwsYW1BxiK/z7QTMyygaUvR0xwWkNhtuO+LYHWzorWloP6o79lkp p4Umg4v6om9c9o8zGL2PyqfWgfkwSnThGrG+RuGQeCSQUknvVLAXoWPveFizESUJ2E6D HxWA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:date:message-id:subject:from:to; bh=oPjGFP7WIOyic43Wlj8BcaSW65EVgyQEN5ZPESgc1ro=; b=C/dSEj5H/ers0tzEQ/UvYBVIbw89J2OZ57JtmGB/2hf2fmc36eJ/8uVVMyPQ5oIncC vHFO9PNWZSmQSapbmC3Z4tuqL+rbU5xvvaCNa/vWn00Zqn8Xea21IUPFympDipe1myx4 xJRSZHtYL/WyGXqSkz51L3nmZ6pBMXCnvH728+am2V9N7ml4ih+M7cMyPFiMoCHimi0p txgrmZAa/sqta95ahMxRvpN1dI/GfHk9UielUOo/uUffMG1ne8jzQ9fiMSw2AqRt6tIg qnrCFrSUm+2HgTtYS2F9/Q2gFWnB76qDs3j7BUg0vJ9FK9RZl7XfvuZtV/x0YWTuY2k4 5Brg== X-Gm-Message-State: AD7BkJInNxakEs+H4NdqsKarwVBpLTqmS9XntrOD40CJg0I2goZ+9dCWAr671yK3Fq0XX9UqEdsVkWvcXPASnQ== MIME-Version: 1.0 X-Received: by 10.28.180.84 with SMTP id d81mr267512wmf.42.1456853720387; Tue, 01 Mar 2016 09:35:20 -0800 (PST) Received: by 10.27.87.19 with HTTP; Tue, 1 Mar 2016 09:35:20 -0800 (PST) Date: Tue, 1 Mar 2016 09:35:20 -0800 Message-ID: Subject: Windows, watermarks, and late data From: Michael Radford To: user@flink.apache.org Content-Type: text/plain; charset=UTF-8 I'm evaluating Flink for a reporting application that will keep various aggregates updated in a database. It will be consuming from Kafka queues that are replicated from remote data centers, so in case there is a long outage in replication, I need to decide what to do about windowing and late data. If I use Flink's built-in windows and watermarks, any late data will be come in 1-element windows, which could overwhelm the database if a large batch of late data comes in and they are each mapped to individual database updates. As far as I can tell, I have two options: 1. Ignore late data, by marking it as late in an AssignerWithPunctuatedWatermarks function, and then discarding it in a flatMap operator. In this scenario, I would rely on a batch process to fill in the missing data later, in the lambda architecture style. 2. Implement my own watermark logic to allow full windows of late data. It seems like I could, for example, emit a "tick" message that is replicated to all partitions every n messages, and then a custom Trigger could decide when to purge each window based on the ticks and a timeout duration. The system would never emit a real Watermark. My questions are: - Am I mistaken about either of these, or are there any other options I'm not seeing for avoiding 1-element windows? - For option 2, are there any problems with not emitting actual watermarks, as long as the windows are eventually purged by a trigger? Thanks, Mike