Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 10C0B200C2A for ; Wed, 1 Mar 2017 21:29:35 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 0F4B6160B70; Wed, 1 Mar 2017 20:29:35 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 563D0160B56 for ; Wed, 1 Mar 2017 21:29:34 +0100 (CET) Received: (qmail 64444 invoked by uid 500); 1 Mar 2017 20:29:33 -0000 Mailing-List: contact dev-help@apex.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@apex.apache.org Delivered-To: mailing list dev@apex.apache.org Received: (qmail 64432 invoked by uid 99); 1 Mar 2017 20:29:33 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Mar 2017 20:29:33 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id B234F189511 for ; Wed, 1 Mar 2017 20:29:32 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.479 X-Spam-Level: ** X-Spam-Status: No, score=2.479 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=datatorrent-com.20150623.gappssmtp.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 5X9SDVbwN_bW for ; Wed, 1 Mar 2017 20:29:30 +0000 (UTC) Received: from mail-ua0-f181.google.com (mail-ua0-f181.google.com [209.85.217.181]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id F398E5F23D for ; Wed, 1 Mar 2017 20:29:29 +0000 (UTC) Received: by mail-ua0-f181.google.com with SMTP id f54so53095655uaa.1 for ; Wed, 01 Mar 2017 12:29:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=datatorrent-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=Ij3Pj3QU98NfmKjMYccb4YJDmUEKWOH9/JAI21YZCTg=; b=BWdRuK06g7NN+yPNovOjIMsRDZR4J5e5iAlsZfPrXiNUQZjv5xx+MqpHPFNF0HhuIs iKFwTvAvO5RES6tbjwezOrjkPQNStj+dTHPsD/AWM2JQNdaGO0LKRRdFQSdEASOd8bI1 6zTQWVEkCFfebDBGJByHJGpjdmL/gBb6xnD/FohdxkWLHRTCx4Qy9YsAF697iUtavIrJ r6OLVPoIMMVqmPUAiegP/k3FZHDf2o4CrXLGkO0C8I0pRPYSBAaBtggpqYM7zTq4UlCF T8R5dO5+BYySfUOL8BBNVCRA+T0Y0cZT+TL0EWC5Nv3OxEB/SX9myMTzFODXquHrrsuT 5vig== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=Ij3Pj3QU98NfmKjMYccb4YJDmUEKWOH9/JAI21YZCTg=; b=j0VYYdcR1jLBjry4Zds5Z7xsK9xdX0Se1GKunApJeSHrDzQX8LBaB4IaRrqghttG/b jKzoJBABGia9qijdVGdfmlg6LZq3HQoq2jkDO+P8Q/UdaRDjhEgZXAc+83wdKZ58Oo5S ckVEnPaduMJlW3DJ8Pnv06kKT8fhUJ7m76pT7F1i3GL9CWgHf5NsI+rn1kQgaTzMAdbR 0YrfJHLVpN/kXzWJxcA9FOTfW0Skj0z2HGsPWIs/BbC/7aauRtwpDWLwTlYxEfPwdXF2 lugGc6fyAGcuzf20ydyKjSy9vX0J+apg090GKAYg6R5wUvwTdqx1VKTYsYj+1BdmyDKS Ic4w== X-Gm-Message-State: AMke39kNyucxcz0DxE2aiNXYFHjobeFBDyExoD/W9IvH+9/brbnSXNVrFBbjSzopB95Wd12xyPJegdlCRcdHRCxB X-Received: by 10.159.48.218 with SMTP id k26mr4111798uab.134.1488400168776; Wed, 01 Mar 2017 12:29:28 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Sandesh Hegde Date: Wed, 01 Mar 2017 20:29:18 +0000 Message-ID: Subject: Re: APEXCORE-619 Recovery windowId in future during application relaunch. To: dev@apex.apache.org Content-Type: multipart/alternative; boundary=f403045dc5fa580a590549b12bf1 archived-at: Wed, 01 Mar 2017 20:29:35 -0000 --f403045dc5fa580a590549b12bf1 Content-Type: text/plain; charset=UTF-8 1. Create an empty checkpoint file for the stateless operators. 2. Remove the logic to treat stateless operators as a special case. Rest of the design remains as is. On Wed, Mar 1, 2017 at 11:18 AM Amol Kekre wrote: > The third option should be it. > 1. On relaunch the DAG should start at commitWindowId > 2. Pruning of checkpoints should only happen after committedWindowId is > written by Stram state > > Thks > Amol > > > > E:amol@datatorrent.com | M: 510-449-2606 <(510)%20449-2606> | Twitter: > @*amolhkekre* > > www.datatorrent.com | apex.apache.org > > *Join us at Apex Big Data World-San Jose > , April 4, 2017!* > [image: http://www.apexbigdata.com/san-jose-register.html] > > > On Wed, Mar 1, 2017 at 5:34 AM, Tushar Gosavi wrote: > > > Help Needed for APEXCORE-619 > > > > Issue : When application is relaunched after long time with stateless > > opeartors at the end of the DAG, the stateless operators starts with a > very > > high windowId. In this case the stateless operator ignors all the data > > received till upstream operator catches up with it. This breaks the > > *at-least-once* gaurantee while relaunch of the opeartor or when master > is > > killed and application is restarted. > > > > Solutions: > > - Fix windowId for stateless leaf operators from upstream opeartor. But > it > > has some issues when we have a join with two upstrams operators at > > different windowId. If we set the windowID to min(upstream windowId), > then > > we need to again recalulate the new recovery window ids for upstream > paths > > from this operators. > > > > - Other solution is to create a empty file in checkpoint directory for > > stateless operators. This will help us to identify the checkpoints of > > stateless operators during relaunch instead of computing from latest > > timestamp. > > > > - Bring the entire DAG to committedWindowId. This could be achived using > > writing committedWindowId in a journal. we need to make sure that we are > > not puring the checkpointed state until the committedWundowId is saved in > > journal. > > > > Let me know your thoughs on this and preferred solution. > > > > Regards, > > -Tushar. > > > -- *Join us at Apex Big Data World-San Jose , April 4, 2017!* [image: http://www.apexbigdata.com/san-jose-register.html] --f403045dc5fa580a590549b12bf1--