From dev-return-17302-archive-asf-public=cust-asf.ponee.io@beam.apache.org Thu Jun 6 15:03:39 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 82A8518062B for ; Thu, 6 Jun 2019 17:03:39 +0200 (CEST) Received: (qmail 74440 invoked by uid 500); 6 Jun 2019 15:03:38 -0000 Mailing-List: contact dev-help@beam.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@beam.apache.org Delivered-To: mailing list dev@beam.apache.org Received: (qmail 74430 invoked by uid 99); 6 Jun 2019 15:03:38 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Jun 2019 15:03:38 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id A8A84C1E01 for ; Thu, 6 Jun 2019 15:03:37 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.899 X-Spam-Level: X-Spam-Status: No, score=-0.899 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=seznam.cz Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id Mjomu3JINJcv for ; Thu, 6 Jun 2019 15:03:35 +0000 (UTC) Received: from mxc2.seznam.cz (mxc2.seznam.cz [77.75.77.23]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id CE76A5F1C2 for ; Thu, 6 Jun 2019 15:03:34 +0000 (UTC) Received: from email.seznam.cz by email-smtpc1b.ng.seznam.cz (email-smtpc1b.ng.seznam.cz [10.23.13.15]) id 20b875b3f69e544820af8e59; Thu, 06 Jun 2019 17:03:28 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=seznam.cz; s=beta; t=1559833408; bh=/K2cYDsALfzBYtvrC+PSgfQ5FofT03YezJPjttO7qAE=; h=Received:Subject:To:References:From:Message-ID:Date:User-Agent: MIME-Version:In-Reply-To:Content-Type:Content-Transfer-Encoding: Content-Language; b=UFiEzhuhki1SW6kPIvZcBYjUH1ZLxcQ2X5Nkuf3m3O3rTG4Hbrp16XV7GG4np5ZzB TPdRrxTENe9UZ+GC75KS3r9P5yO2iWudnJnU6X4rcOAa3Xi350ENaxQLaHPVhf145I 57NYUkOfi4rGMZXOmGxXxbsy8fqX5vLH+IP+YyN8= Received: from [192.168.43.74] (37-48-37-54.nat.epc.tmcz.cz [37.48.37.54]) by email-relay7.ng.seznam.cz (Seznam SMTPD 1.3.104) with ESMTP; Thu, 06 Jun 2019 17:03:25 +0200 (CEST) Subject: Re: @RequireTimeSortedInput design draft To: dev@beam.apache.org References: <879ef81c-2f19-98d8-ea5b-4661e3a01d25@seznam.cz> From: =?UTF-8?Q?Jan_Lukavsk=c3=bd?= Message-ID: Date: Thu, 6 Jun 2019 17:03:24 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US Hi, I have written a PoC implementation of this in [1] and I'd like to discuss some implementation details. First of all, I'd appreciate any feedback about this. There are some known issues:  1) need to figure out how to get Coder of input PCollection of stateful ParDo inside StatefulDoFnRunner  2) there are performance considerations, that can be solved probably only by Sorted Map State [2]  3) additional work is needed for allowedLateness to work correctly (and there are at least two ways how to solve this), see the design doc [3]  4) more tests (for batch and validatesRunner) are needed I have come across a few bugs in DirectRunner, which I tried to solve:  a) timers seem to be broken in stateful pardo with side inputs  b) timers need to be sorted by timestamp, otherwise state might be cleared before it gets chance to be flushed Thanks for feedback,  Jan [1] https://github.com/apache/beam/pull/8774 [2] http://mail-archives.apache.org/mod_mbox/beam-dev/201905.mbox/%3cCALsTK6+LdEmTjmnUYSn3vCufywjkhMgv1iSFBdMXTHoqH91xTQ@mail.gmail.com%3e [3] https://docs.google.com/document/d/1ObLVUFsf1NcG8ZuIZE4aVy2RYKx2FfyMhkZYWPnI9-c/ On 5/23/19 4:40 PM, Robert Bradshaw wrote: > Thanks for writing this up. > > I think the justification for adding this to the model needs to be > that it is useful (you have this covered, though some examples would > be nice) and that it's something that can't easily be done by users > themselves (specifically, though it can be (relatively) cheaply done > in streaming and batch, it's done in very different ways, and also > that it's hard to do via composition). > > On Thu, May 23, 2019 at 4:10 PM Jan Lukavský wrote: >> Hi, >> >> I have written a very brief draft of how it might be possible to >> implement @RequireTimeSortedInput discussed in [1]. I see the document >> [2] a starting point for a discussion. There are several open questions, >> which I believe can be resolved by this great community. :-) >> >> Jan >> >> [1] http://mail-archives.apache.org/mod_mbox/beam-dev/201905.mbox/browser >> >> [2] >> https://docs.google.com/document/d/1ObLVUFsf1NcG8ZuIZE4aVy2RYKx2FfyMhkZYWPnI9-c/ >>