apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sasha Parfenov <sa...@datatorrent.com>
Subject Re: Supporting iterations in Apex
Date Fri, 18 Sep 2015 00:15:06 GMT
Given some Apache projects have already been referring to a similar concept
as iterations, it may make sense to stick with that terminology.  See
https://ci.apache.org/projects/flink/flink-docs-release-0.7/iterations.html

On Thu, Sep 17, 2015 at 4:31 PM, Pramod Immaneni <pramod@datatorrent.com>
wrote:

> I agree DELAY or MEMORY is a well defined concept in other areas such as
> digital electronics, VLSI, Digital Signal Processing.
>
> On Thu, Sep 17, 2015 at 3:39 PM, Chetan Narsude <chetan@datatorrent.com>
> wrote:
>
> > Iteration implies that something is looping over. Whereas that's just one
> > use case of this functionality. One can take the output of an upstream
> > operator and give it to input of the downstream operator.
> >
> > AFAIK, DELAY  is very well understood concept in event processing and
> > analogous to how we intend to use it.
> >
> >
> > On Wed, Sep 16, 2015 at 5:32 PM, David Yan <david@datatorrent.com>
> wrote:
> >
> > > I think keeping the word ITERATION is clearer to the users because
> that's
> > > what it is for.
> > > The user wouldn't think he/she is trying to "delay" something...
> > > In any case, I am fine either way :)
> > >
> > > David
> > >
> > > On Wed, Sep 16, 2015 at 5:12 PM, Munagala Ramanath <
> ram@datatorrent.com>
> > > wrote:
> > >
> > > > I like ITERATION_WINDOW_OFFSET.
> > > >
> > > > Ram
> > > >
> > > > On Wed, Sep 16, 2015 at 4:42 PM, David Yan <david@datatorrent.com>
> > > wrote:
> > > >
> > > > > Thanks Chetan.
> > > > >
> > > > > Can you point me to the location of Deduper code that may be
> helpful
> > > with
> > > > > the recovery implementation?
> > > > >
> > > > > Does anyone have any opinion on the renaming of
> > ITERATION_WINDOW_COUNT?
> > > > > DELAY_BY_WINDOW_COUNT? DELAY_WINDOW_COUNT?
> > > > >
> > > > > David
> > > > >
> > > > > On Wed, Sep 16, 2015 at 2:21 PM, Chetan Narsude <
> > > chetan@datatorrent.com>
> > > > > wrote:
> > > > >
> > > > > > David,
> > > > > >
> > > > > >  I have 3 comments:
> > > > > >
> > > > > > 1. The "ahead window" phrase you discussed above is really behind
> > > > window.
> > > > > > With Apex, the windows which are ahead are the windows with
> smaller
> > > > > window
> > > > > > Id. smaller window ids are followed by bigger window ids.
> > > > > >
> > > > > > 2.  ITERATION_WINDOW_COUNT sounds like a misnomer. IMO, It
> should
> > be
> > > > > > something akin to DELAY_BY_WINDOW_COUNT as you are delaying
the
> > > events
> > > > by
> > > > > > those many windows. You are not iterating over them as many
> times.
> > It
> > > > > also
> > > > > > resonates with PortContext.SLIDE_BY_WINDOW_COUNT
> > > > > >
> > > > > > 3. Deduper has similar requirement where large amount of data
> > > > > (potentially
> > > > > > even larger) needs to be partitioned. You can borrow the
> idea/code
> > > from
> > > > > > there. And perhaps abstract the code to be reusable.
> > > > > >
> > > > > > HTH.
> > > > > >
> > > > > > --
> > > > > > Chetan
> > > > > >
> > > > > > On Wed, Sep 16, 2015 at 1:44 PM, David Yan <
> david@datatorrent.com>
> > > > > wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > One current disadvantage of Apex is the inability to do
> > iterations
> > > > and
> > > > > > > machine learning algorithms because we don't allow loops
in the
> > > > > > application
> > > > > > > DAG (hence the name DAG).  I am proposing that we allow
loops
> in
> > > the
> > > > > DAG
> > > > > > if
> > > > > > > the loop advances the window ID by a configured amount.
 A JIRA
> > > > ticket
> > > > > > has
> > > > > > > been created:
> > > > > > >
> > > > > > > https://malhar.atlassian.net/browse/APEX-60
> > > > > > >
> > > > > > > I have started this work in my fork at
> > > > > > > https://github.com/davidyan74/incubator-apex-core/tree/APEX-60
> .
> > > > > > >
> > > > > > > The current progress is that a simple test case works.
 Major
> > work
> > > > > still
> > > > > > > needs to be done with respect to recovery and partitioning.
> > > > > > >
> > > > > > > The value ITERATION_WINDOW_COUNT is an attribute to an
input
> port
> > > of
> > > > an
> > > > > > > operator.  If the value of the attribute is greater than
or
> equal
> > > to
> > > > 1,
> > > > > > any
> > > > > > > tuples sent to the input port are treated to be
> > > > ITERATION_WINDOW_COUNT
> > > > > > > windows ahead of what they are.
> > > > > > >
> > > > > > > For recovery, we will need to checkpoint all the tuples
between
> > > ports
> > > > > > with
> > > > > > > the to replay the looped tuples.  During the recovery,
if the
> > > > operator
> > > > > > has
> > > > > > > an input port, with ITERATION_WINDOW_COUNT=2, is recovering
> from
> > > > > > checkpoint
> > > > > > > window 14, the tuples for that input port from window 13
and
> > window
> > > > 14
> > > > > > need
> > > > > > > to be replayed to be treated as window 15 and window 16
> > > respectively
> > > > > > (13+2
> > > > > > > and 14+2).
> > > > > > >
> > > > > > > In other words, we need to store all the tuples from window
> with
> > ID
> > > > > > > committedWindowId minus ITERATION_WINDOW_COUNT for recovery
and
> > > purge
> > > > > the
> > > > > > > tuples earlier than that window.
> > > > > > > We can optimize this by only storing the tuples for
> > > > > > ITERATION_WINDOW_COUNT
> > > > > > > windows prior to any checkpoint.
> > > > > > >
> > > > > > > For that, we need a storage mechanism for the tuples. 
Chandni
> > > > already
> > > > > > has
> > > > > > > something that fits this usage case in Apex Malhar.  The
class
> is
> > > > > > > IdempotentStorageManager.  In order for this to be used
in Apex
> > > core,
> > > > > we
> > > > > > > need to deprecate the class in Apex Malhar and move it
to Apex
> > > Core.
> > > > > > >
> > > > > > > A JIRA ticket has been created for this particular work:
> > > > > > >
> > > > > > > https://malhar.atlassian.net/browse/APEX-128
> > > > > > >
> > > > > > > Some of the above has been discussed among Thomas, Chetan,
> > Chandni,
> > > > and
> > > > > > > myself.
> > > > > > >
> > > > > > > For partitioning, we have not started any discussion or
> > > > brainstorming.
> > > > > > We
> > > > > > > appreciate any feedback on this and any other aspect related
to
> > > > > > supporting
> > > > > > > iterations in general.
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > > > David
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message