Return-Path: X-Original-To: apmail-apex-dev-archive@minotaur.apache.org Delivered-To: apmail-apex-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 461D817FAF for ; Thu, 17 Sep 2015 22:45:45 +0000 (UTC) Received: (qmail 61096 invoked by uid 500); 17 Sep 2015 22:45:45 -0000 Delivered-To: apmail-apex-dev-archive@apex.apache.org Received: (qmail 61042 invoked by uid 500); 17 Sep 2015 22:45:45 -0000 Mailing-List: contact dev-help@apex.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@apex.incubator.apache.org Delivered-To: mailing list dev@apex.incubator.apache.org Received: (qmail 61031 invoked by uid 99); 17 Sep 2015 22:45:44 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Sep 2015 22:45:44 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 7D0061A210A for ; Thu, 17 Sep 2015 22:45:44 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3 X-Spam-Level: *** X-Spam-Status: No, score=3 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 1qjJi_WCm3s3 for ; Thu, 17 Sep 2015 22:45:34 +0000 (UTC) Received: from mail-wi0-f179.google.com (mail-wi0-f179.google.com [209.85.212.179]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id AC2A120F93 for ; Thu, 17 Sep 2015 22:45:33 +0000 (UTC) Received: by wicfx3 with SMTP id fx3so9314527wic.0 for ; Thu, 17 Sep 2015 15:45:32 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=QTgFUu5pirC1MaNmzVEQccVoWQGOFHau/UiEdTXPn5Q=; b=FbQnkYIvYODzESAs+5g6rzwlWadRQBF2PyG0Vh52FdPqYkLAsGHpIEzJANbJREwIoJ B4ky4ED2Vs8dF7ah0FrMZz7SFwl/h44Hs7pVKdcHRiusgfG+glfTQTuvQSEyN2HdW6MJ bqRepRgWdS/BryvkbXmn7d1bARzs7OILskBkwjJSZ/WRe1/72zLSJ2hvplSh5kwJzMqa UQiDNbOyxzsaRaZgMU/x8zWJTNlrd6S2rvgXI2o2weeOu6mal+1tkWLhBR0hBMf0Z4un ZikXzacPbq5tH7/UTlDGcI3fg76e+4zTOkJXvvOyj5v9vRY+kdGaIuYJFVyiR5maN8Vm mVxQ== X-Gm-Message-State: ALoCoQn5O/n0+DTktcDbYIuh12P56ZswSwDigWVfyN5ojX1BtDqAiTbKxaMIfeWB+G/xYQJ6J7Tv MIME-Version: 1.0 X-Received: by 10.180.187.142 with SMTP id fs14mr12212898wic.6.1442529932428; Thu, 17 Sep 2015 15:45:32 -0700 (PDT) Received: by 10.28.98.68 with HTTP; Thu, 17 Sep 2015 15:45:32 -0700 (PDT) In-Reply-To: References: Date: Thu, 17 Sep 2015 15:45:32 -0700 Message-ID: Subject: Re: Supporting iterations in Apex From: Chandni Singh To: dev@apex.incubator.apache.org Content-Type: multipart/alternative; boundary=001a11c25e3c335d57051ff92c19 --001a11c25e3c335d57051ff92c19 Content-Type: text/plain; charset=UTF-8 +1 for ITERATION_WINDOW_OFFSET. We are trying to enable APEX for Iterative Machine Learning and this sounds clearer/less confusing to me. On Thu, Sep 17, 2015 at 3:39 PM, Chetan Narsude wrote: > Iteration implies that something is looping over. Whereas that's just one > use case of this functionality. One can take the output of an upstream > operator and give it to input of the downstream operator. > > AFAIK, DELAY is very well understood concept in event processing and > analogous to how we intend to use it. > > > On Wed, Sep 16, 2015 at 5:32 PM, David Yan wrote: > > > I think keeping the word ITERATION is clearer to the users because that's > > what it is for. > > The user wouldn't think he/she is trying to "delay" something... > > In any case, I am fine either way :) > > > > David > > > > On Wed, Sep 16, 2015 at 5:12 PM, Munagala Ramanath > > wrote: > > > > > I like ITERATION_WINDOW_OFFSET. > > > > > > Ram > > > > > > On Wed, Sep 16, 2015 at 4:42 PM, David Yan > > wrote: > > > > > > > Thanks Chetan. > > > > > > > > Can you point me to the location of Deduper code that may be helpful > > with > > > > the recovery implementation? > > > > > > > > Does anyone have any opinion on the renaming of > ITERATION_WINDOW_COUNT? > > > > DELAY_BY_WINDOW_COUNT? DELAY_WINDOW_COUNT? > > > > > > > > David > > > > > > > > On Wed, Sep 16, 2015 at 2:21 PM, Chetan Narsude < > > chetan@datatorrent.com> > > > > wrote: > > > > > > > > > David, > > > > > > > > > > I have 3 comments: > > > > > > > > > > 1. The "ahead window" phrase you discussed above is really behind > > > window. > > > > > With Apex, the windows which are ahead are the windows with smaller > > > > window > > > > > Id. smaller window ids are followed by bigger window ids. > > > > > > > > > > 2. ITERATION_WINDOW_COUNT sounds like a misnomer. IMO, It should > be > > > > > something akin to DELAY_BY_WINDOW_COUNT as you are delaying the > > events > > > by > > > > > those many windows. You are not iterating over them as many times. > It > > > > also > > > > > resonates with PortContext.SLIDE_BY_WINDOW_COUNT > > > > > > > > > > 3. Deduper has similar requirement where large amount of data > > > > (potentially > > > > > even larger) needs to be partitioned. You can borrow the idea/code > > from > > > > > there. And perhaps abstract the code to be reusable. > > > > > > > > > > HTH. > > > > > > > > > > -- > > > > > Chetan > > > > > > > > > > On Wed, Sep 16, 2015 at 1:44 PM, David Yan > > > > wrote: > > > > > > > > > > > Hi all, > > > > > > > > > > > > One current disadvantage of Apex is the inability to do > iterations > > > and > > > > > > machine learning algorithms because we don't allow loops in the > > > > > application > > > > > > DAG (hence the name DAG). I am proposing that we allow loops in > > the > > > > DAG > > > > > if > > > > > > the loop advances the window ID by a configured amount. A JIRA > > > ticket > > > > > has > > > > > > been created: > > > > > > > > > > > > https://malhar.atlassian.net/browse/APEX-60 > > > > > > > > > > > > I have started this work in my fork at > > > > > > https://github.com/davidyan74/incubator-apex-core/tree/APEX-60. > > > > > > > > > > > > The current progress is that a simple test case works. Major > work > > > > still > > > > > > needs to be done with respect to recovery and partitioning. > > > > > > > > > > > > The value ITERATION_WINDOW_COUNT is an attribute to an input port > > of > > > an > > > > > > operator. If the value of the attribute is greater than or equal > > to > > > 1, > > > > > any > > > > > > tuples sent to the input port are treated to be > > > ITERATION_WINDOW_COUNT > > > > > > windows ahead of what they are. > > > > > > > > > > > > For recovery, we will need to checkpoint all the tuples between > > ports > > > > > with > > > > > > the to replay the looped tuples. During the recovery, if the > > > operator > > > > > has > > > > > > an input port, with ITERATION_WINDOW_COUNT=2, is recovering from > > > > > checkpoint > > > > > > window 14, the tuples for that input port from window 13 and > window > > > 14 > > > > > need > > > > > > to be replayed to be treated as window 15 and window 16 > > respectively > > > > > (13+2 > > > > > > and 14+2). > > > > > > > > > > > > In other words, we need to store all the tuples from window with > ID > > > > > > committedWindowId minus ITERATION_WINDOW_COUNT for recovery and > > purge > > > > the > > > > > > tuples earlier than that window. > > > > > > We can optimize this by only storing the tuples for > > > > > ITERATION_WINDOW_COUNT > > > > > > windows prior to any checkpoint. > > > > > > > > > > > > For that, we need a storage mechanism for the tuples. Chandni > > > already > > > > > has > > > > > > something that fits this usage case in Apex Malhar. The class is > > > > > > IdempotentStorageManager. In order for this to be used in Apex > > core, > > > > we > > > > > > need to deprecate the class in Apex Malhar and move it to Apex > > Core. > > > > > > > > > > > > A JIRA ticket has been created for this particular work: > > > > > > > > > > > > https://malhar.atlassian.net/browse/APEX-128 > > > > > > > > > > > > Some of the above has been discussed among Thomas, Chetan, > Chandni, > > > and > > > > > > myself. > > > > > > > > > > > > For partitioning, we have not started any discussion or > > > brainstorming. > > > > > We > > > > > > appreciate any feedback on this and any other aspect related to > > > > > supporting > > > > > > iterations in general. > > > > > > > > > > > > Thanks! > > > > > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > --001a11c25e3c335d57051ff92c19--