apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashwin Chandra Putta <ashwinchand...@gmail.com>
Subject Re: operator recovery window
Date Tue, 15 Dec 2015 23:09:27 GMT
Well, that is true in any case irrespective of recovery at committed window
or further checkpointed window; that we start reprocessing at a window
after recovered checkpointed window, as the checkpoint state is after the
end window of the window that is checkpointed and before the next window
begins.

Here is a summary for future reference.

At operator failure,

1. The operator recovers at its largest checkpointed window that is less
than or equal to the largest common checkpointed window of all the
downstream operators.
2. For output operators, they always recover at their largest checkpointed
window (as they have no dependency on downstream operators)
3. The operator will recover at committed window when there is no
checkpoint for the operator or for any of the downstream operators after
the committed window.

Regards,
Ashwin.

On Tue, Dec 15, 2015 at 2:35 PM, Timothy Farkas <tim@datatorrent.com> wrote:

> Yes and no.
>
> Window 30 could be committed and we could restore to the corresponding
> checkpoint. The corresponding checkpoint for window 30 is taken after
> window 30 is done, but before window 31 has begun. So when we restore to
> the checkpoint we will not redo window 30, we will start at window 31. Once
> a window id is committed it will never be redone by any operator.
>
> On Tue, Dec 15, 2015 at 2:04 PM, Ashwin Chandra Putta <
> ashwinchandrap@gmail.com> wrote:
>
> > Siyuan,
> >
> > Yes, we are discussing at least once semantics.
> >
> > Tim,
> >
> > So it is indeed possible that we recover at committed window id in a case
> > where we just committed and there were no further checkpoints before
> > failure.
> >
> > Regards,
> > Ashwin.
> >
> > On Tue, Dec 15, 2015 at 1:54 PM, Timothy Farkas <tim@datatorrent.com>
> > wrote:
> >
> > > Whoops my bad, that would never happen. There is a check that only
> allows
> > > purging of checkpoints for an operator if the operator has more than
> one
> > > checkpoint. :)
> > >
> > > On Tue, Dec 15, 2015 at 1:39 PM, Timothy Farkas <tim@datatorrent.com>
> > > wrote:
> > >
> > > > Siyuan, then Ashwin may be right that there is an issue. Looking at
> the
> > > > code again I think this could happen:
> > > >
> > > > 1 - All operators reach checkpiont 30
> > > > 2 - Checkpoints are updated on heartbeat and committed window is now
> > 25,
> > > > everything before window 30 is purged
> > > > 3 - no new checkpoint is reached for any operator
> > > > 4 - Checkpoints are updated on heartbeat again and committed window
> is
> > > now
> > > > 30, now window 30 is purged.
> > > >
> > > > May be missing something again though.
> > > >
> > > > On Tue, Dec 15, 2015 at 1:32 PM, Siyuan Hua <siyuan@datatorrent.com>
> > > > wrote:
> > > >
> > > >> My understanding is the committed window could possibly be 30 as
> well,
> > > >> depends on whether container manager get heart beat from containers.
> > > >>
> > > >> And I guess the discussion is assuming at_least_once semantic? :)
> > > >> at_most_once should have different recovery window.
> > > >>
> > > >> On Tue, Dec 15, 2015 at 12:01 PM, Timothy Farkas <
> tim@datatorrent.com
> > >
> > > >> wrote:
> > > >>
> > > >> > Hi Ashwin,
> > > >> >
> > > >> > In your example, if A fails the recovery windows would be
> > > >> >
> > > >> > D - 15
> > > >> > C - 15
> > > >> > B - 15
> > > >> > A - 15
> > > >> >
> > > >> > If C fails the recovery windows would be
> > > >> >
> > > >> > D -15
> > > >> > C -15
> > > >> > B - 25
> > > >> > A - 30
> > > >> >
> > > >> > If every operator just reached window 30 and checkpointed, the
> > > committed
> > > >> > window would be 25, and all the checkpoints before window 30
would
> > be
> > > >> > purged, but the checkpoint for window 30 would not be purged.
> > > >> >
> > > >> > Thanks,
> > > >> > Tim
> > > >> >
> > > >> > On Tue, Dec 15, 2015 at 11:41 AM, Ashwin Chandra Putta <
> > > >> > ashwinchandrap@gmail.com> wrote:
> > > >> >
> > > >> > > Tim,
> > > >> > >
> > > >> > > Thanks, that is pretty much inline with what I was thinking.
A
> > > little
> > > >> > > different thought though in terms of picking the checkpoint
> based
> > on
> > > >> > > downstream operators. For A, is it not going to be "the
> checkpoint
> > > >> with
> > > >> > the
> > > >> > > largest window id that is less than or equal to the checkpoint
> > with
> > > >> the
> > > >> > > largest common window id (instead of largest window id)
among
> all
> > > the
> > > >> > > operators down stream to A"
> > > >> > >
> > > >> > > For example,
> > > >> > >
> > > >> > > If A -> B -> C -> D is the dag. And say, the checkpoint
window
> > count
> > > >> is 5
> > > >> > > and the largest checkpoints are as follows.
> > > >> > >
> > > >> > > A - 30
> > > >> > > B - 25
> > > >> > > C - 20
> > > >> > > D - 15
> > > >> > >
> > > >> > > Does A recover at 25 (checkpoint with largest window id)
or 15
> > > >> > (checkpoint
> > > >> > > with largest common window id)?
> > > >> > >
> > > >> > > Also, regarding recovering at committed window id. Is it
not
> > > possible
> > > >> in
> > > >> > > the following scenario where all operators have checkpointed
at
> 30
> > > and
> > > >> > got
> > > >> > > the committed window call back. And then an operator fails
> before
> > > any
> > > >> > > operator checkpoints further. In that case, the recovery
window
> is
> > > 30
> > > >> > > right?
> > > >> > >
> > > >> > > Regards,
> > > >> > > Ashwin.
> > > >> > >
> > > >> > > On Mon, Dec 14, 2015 at 11:58 PM, Timothy Farkas <
> > > tim@datatorrent.com
> > > >> >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Hi Ashwin,
> > > >> > > >
> > > >> > > > The recovery checkpoint for operator A is computed
by taking
> the
> > > >> > > checkpoint
> > > >> > > > with the largest window id that is less than or equal
to the
> > > >> checkpoint
> > > >> > > > with the largest window id among all the operators
down stream
> > to
> > > A.
> > > >> > The
> > > >> > > > output operators in a dag will always recover to their
most
> > recent
> > > >> > > > checkpoint. The input operator of the dag may recover
to the
> > > >> earliest
> > > >> > > > checkpoint. Operators between the input and ouput operators
> > could
> > > >> > recover
> > > >> > > > to a window in between.
> > > >> > > >
> > > >> > > > I don't think you can ever recover to a committed window,
the
> > > >> earliest
> > > >> > I
> > > >> > > > think you can recover to is the window after the committed
> > window
> > > >> (may
> > > >> > be
> > > >> > > > wrong on this).
> > > >> > > >
> > > >> > > > On Mon, Dec 14, 2015 at 11:05 PM, Ashwin Chandra Putta
<
> > > >> > > > ashwinchandrap@gmail.com> wrote:
> > > >> > > >
> > > >> > > > > In the apex architecture there is concept of checkpointing
> and
> > > >> > concept
> > > >> > > of
> > > >> > > > > committed when all operator have crossed a common
> checkpoint.
> > > >> > > > >
> > > >> > > > > So, in which scenarios does a given operator recover
at last
> > > >> > checkpoint
> > > >> > > > > window vs last committed window vs some other
checkpoint
> > window
> > > in
> > > >> > > > between?
> > > >> > > > > --
> > > >> > > > >
> > > >> > > > > Regards,
> > > >> > > > > Ashwin.
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > --
> > > >> > >
> > > >> > > Regards,
> > > >> > > Ashwin.
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
> >
> >
> > --
> >
> > Regards,
> > Ashwin.
> >
>



-- 

Regards,
Ashwin.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message