apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandesh Hegde <sand...@datatorrent.com>
Subject Re: [Proposal] Named Checkpoints
Date Wed, 10 Aug 2016 14:58:36 GMT
Thanks for the review Tushar. Here are my answers,

1. Just storing the mapping from operator id to checkpoint is not enough
  Because "State" = libs + checkpoints ( libs can change in future
https://issues.apache.org/jira/browse/APEXCORE-232 )

2. Valid point about users using Storage Agents, there is no easy solution
for that
   a. StorageAgent interface doesn't support the method to copy the
checkpoints
   b. ApexCLI doesn't load StorageAgent
   c. SavePoint can be taken for the killed app, so can't rely on running
App to make the copy.

In the v1 we can have,
1. savepoint from HDFS
2. launching from savepoint, checkpoints could be on different Storage
Agents ( users need to make the copy ).

What do you think?

Thanks




On Wed, Aug 10, 2016 at 2:17 AM Tushar Gosavi <tushar@datatorrent.com>
wrote:

> The prototype implementation assume that checkpoints are always stored
> in HDFS, but user could implements their own
> storage agent. In this case this implementation may not work. The more
> useful approach would be to have a metadata file
> for each savepoint which stores operator id and checkpoint id. and
> prevent master from purging those checkpoints on commit.
> during restart the storage agent can get required checkpoint from its
> store, and which checkpoints to load will be available in
> savepoint metadata file.
>
> - Tushar.
>
>
>
> On Mon, Aug 8, 2016 at 8:49 PM, Sandesh Hegde <sandesh@datatorrent.com>
> wrote:
> > The idea here was to create, on demand, recovery/committed window. But
> > there is always one(except before the first) recovery window for the DAG.
> > Instead of using/modifying the Checkpoint tuple, I am planning to reuse
> > the existing recovery window state, which simplifies the implementation.
> >
> > Proposed API:
> >
> > ApexCli> savepoint <appId> <folderToSaveTheState>
> > ApexCli> launch -savepoint <folderWithTheState>
> >
> > first prototype:
> >
> https://github.com/sandeshh/apex-core/commit/8ec7e837318c2b33289251cda78ece0024a3f895
> >
> > Thanks
> >
> > On Thu, Aug 4, 2016 at 11:54 AM Amol Kekre <amol@datatorrent.com> wrote:
> >
> >> hmm! actually it may be a good debugging tool too. Keep the named
> >> checkpoints around. The feature is to keep checkpoints around, which
> can be
> >> done by giving a feature to not delete checkpoints, but then naming them
> >> makes it more operational. Send a command from cli->get checkpoint ->
> know
> >> it is the one you need as the file name has your string you send with
> the
> >> command -> debug. This is different that querying a state as this gives
> >> entire app checkpoint to debug with.
> >>
> >> Thks
> >> Amol
> >>
> >>
> >> On Thu, Aug 4, 2016 at 11:41 AM, Venkatesh Kottapalli <
> >> venkatesh@datatorrent.com> wrote:
> >>
> >> > + 1 for the idea.
> >> >
> >> > It might be helpful to developers as well when dealing with variety of
> >> > data in large volumes if this can help them run from the checkpointed
> >> state
> >> > rather than rerunning the application altogether in case of issues.
> >> >
> >> > I have seen cases where the application runs for more than 10 hours
> and
> >> > some partitions fail because of the variety of data that it is dealing
> >> > with. In such cases, the application has to be restarted and it will
> be
> >> > helpful to developers with a feature of this kind.
> >> >
> >> >  The ease of enabling/disabling this feature to run the app will also
> be
> >> > important.
> >> >
> >> > -Venkatesh.
> >> >
> >> >
> >> > > On Aug 4, 2016, at 10:29 AM, Amol Kekre <amol@datatorrent.com>
> wrote:
> >> > >
> >> > > We had an user who wanted roll-back and restart from audit purposes.
> >> That
> >> > > time we did not have timed-window. Names checkpoint would have
> helped a
> >> > > little bit..
> >> > >
> >> > > Problem statement: Auditors ask for rerun of yesterday's
> computations
> >> for
> >> > > verification. Assume that these computations depend on previous
> state
> >> > (i.e
> >> > > data from day before yesterday).
> >> > >
> >> > > Solution
> >> > > 1. Have named checkpoints at 12 in the night (an input adapter
> triggers
> >> > it)
> >> > > every day
> >> > > 2. The app spools raw logs into hdfs along with window ids and event
> >> > times
> >> > > 3. The re-run is a separate app that starts off on a named
> checkpoint
> >> (12
> >> > > night yesterday)
> >> > >
> >> > > Technically the solution will not as simple and "new audit app" will
> >> > need a
> >> > > lot of other checks (dedups, drop events not in yesterday's window,
> >> wait
> >> > > for late arrivals, ...), but names checkpoint helps.
> >> > >
> >> > > I do agree with Pramod's that replay within the same running app is
> not
> >> > > viable within a data-in-motion architecture. But it helps somewhat
> in a
> >> > new
> >> > > audit app. Named checkpoints help data-in-motion architectures
> handle
> >> > batch
> >> > > apps better. In the above case #2 spooling done with event time
> >> > stamp+state
> >> > > suffices. The state part comes from names checkpoint.
> >> > >
> >> > > Thks,
> >> > > Amol
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > On Thu, Aug 4, 2016 at 10:12 AM, Sanjay Pujare <
> sanjay@datatorrent.com
> >> >
> >> > > wrote:
> >> > >
> >> > >> I agree. A specific use-case will be useful to support this
> feature.
> >> > Also
> >> > >> the ability to replay from the named checkpoint will be limited
> >> because
> >> > of
> >> > >> various factors, isn’t it?
> >> > >>
> >> > >> On 8/4/16, 9:00 AM, "Pramod Immaneni" <pramod@datatorrent.com>
> wrote:
> >> > >>
> >> > >>    There is a problem here, keeping old checkpoints and recovering
> >> from
> >> > >> them
> >> > >>    means preserving the old input data along with the state. This
> is
> >> > more
> >> > >> than
> >> > >>    the mechanism of actually creating named checkpoints, it means
> >> having
> >> > >> the
> >> > >>    ability for operators to move forward (a.k.a committed and
> dropping
> >> > >>    committed states and buffer data) while still having the
> ability to
> >> > >> replay
> >> > >>    from that point from the input source and providing a way for
> >> > >> operators (at
> >> > >>    first look input operators) to distinguish that. Why would
> someone
> >> > need
> >> > >>    this with idempotent processing? Is there a specific use case
> you
> >> are
> >> > >>    looking at? Suppose we go do this, for the mechanism, I would
> be in
> >> > >> favor
> >> > >>    of reusing existing tuple.
> >> > >>
> >> > >>    On Thu, Aug 4, 2016 at 8:44 AM, Vlad Rozov <
> >> v.rozov@datatorrent.com>
> >> > >> wrote:
> >> > >>
> >> > >>> +1 for the feature. At first look I am more in favor of reusing
> >> > >> existing
> >> > >>> control tuple.
> >> > >>>
> >> > >>> Thank you,
> >> > >>>
> >> > >>> Vlad
> >> > >>>
> >> > >>>
> >> > >>> On 8/4/16 08:17, Sandesh Hegde wrote:
> >> > >>>
> >> > >>>> @Chinmay
> >> > >>>> We can enhance the existing checkpoint tuple but that
one is more
> >> > >>>> frequently used than this feature, so why burden Checkpoint
tuple
> >> > >> with
> >> > >>>> an extra field?
> >> > >>>>
> >> > >>>> @Aniruddha
> >> > >>>> It is better to leave the scheduling to the users, they
can use
> any
> >> > >> tool
> >> > >>>> that they are already familiar with.
> >> > >>>>
> >> > >>>> On Thu, Aug 4, 2016 at 7:40 AM Aniruddha Thombare <
> >> > >>>> aniruddha@datatorrent.com>
> >> > >>>> wrote:
> >> > >>>>
> >> > >>>> +1 On the idea, it would be awesome to have.
> >> > >>>>>
> >> > >>>>> Question: Can we further develop this brilliant idea
into:-
> >> > >>>>> Scheduled checkpoints ( To save as  dynamically named
> checkpoint)?
> >> > >>>>> This would be on the lines of logrotate / general
backup
> >> > >> strategies.
> >> > >>>>>
> >> > >>>>>
> >> > >>>>> Thanks,
> >> > >>>>>
> >> > >>>>> A
> >> > >>>>>
> >> > >>>>> _____________________________________
> >> > >>>>> Sent with difficulty, I mean handheld ;)
> >> > >>>>> On 4 Aug 2016 8:03 pm, "Munagala Ramanath" <ram@datatorrent.com
> >
> >> > >> wrote:
> >> > >>>>>
> >> > >>>>> +1
> >> > >>>>>>
> >> > >>>>>> Ram
> >> > >>>>>>
> >> > >>>>>> On Thu, Aug 4, 2016 at 12:10 AM, Sandesh Hegde
<
> >> > >> sandesh@datatorrent.com
> >> > >>>>>>>
> >> > >>>>>> wrote:
> >> > >>>>>>
> >> > >>>>>> Hello Team,
> >> > >>>>>>>
> >> > >>>>>>> This thread is to discuss the Named Checkpoint
feature for
> Apex.
> >> > >> (
> >> > >>>>>>> https://issues.apache.org/jira/browse/APEXCORE-498)
> >> > >>>>>>>
> >> > >>>>>>> Named checkpoints allow following workflow,
> >> > >>>>>>>
> >> > >>>>>>> 1. Users can trigger a checkpoint and give
it a name
> >> > >>>>>>> 2. Relaunch the application from the named
checkpoint.
> >> > >>>>>>> 3. These checkpoints survive the "purge of
old checkpoints".
> >> > >>>>>>>
> >> > >>>>>>> Current idea is to add a new control tuple,
> >> > >> NamedCheckPointTuple, which
> >> > >>>>>>> contains the user specified name, it traverses
the DAG and
> along
> >> > >> the
> >> > >>>>>>>
> >> > >>>>>> way
> >> > >>>>>
> >> > >>>>>> necessary actions are taken.
> >> > >>>>>>>
> >> > >>>>>>> Please let me know your thoughts on this.
> >> > >>>>>>>
> >> > >>>>>>> Thanks
> >> > >>>>>>>
> >> > >>>>>>>
> >> > >>>
> >> > >>
> >> > >>
> >> > >>
> >> > >>
> >> >
> >> >
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message