apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandeep Deshmukh <sand...@datatorrent.com>
Subject Re: Enhance batch support - batch demarcation
Date Mon, 28 Dec 2015 15:01:22 GMT
+1 for batch support in Apex. I would be interested to be part of this work.

I would like to start with basics and would like to know how one will
define "batch" in Apex context. Which of the following cases would be
supported under batch:

   1. A program completes a task and auto shutdown itself once the task is
   complete. E.g.  the program needs to copy a set of files from source to
   destination.
   2. A program completes a task and then waits for pre-defined time to
   poll for something more to work on. E.g. the program copies all the files
   from source location and then periodically checks, say every 1 hour, if
   there are new files at the source and copies them.
   3. A program completes a task and then polls every 1 hr as in case 2 but
   releases resources during wait time.

Needs for each of the above will vary. I am putting down some basic
requirements for each of them

1. This case will need a mechanism to shutdown automatically on completion
of the task.

StartProgram()
    StartBatch()
        Streaming Application starts, runs and finishes
    EndBatch()
EndProgram()

2. This will simply need a construct to wait for some time ( say 10
minutes) or till some time ( till 1pm) .

StartProgram()
while(true)
{
    StartBatch()
        Streaming Application starts, runs and finishes
    EndBatch()
    WaitTill(time) or WaitFor(timeperiod)
}
EndProgram()

3. Apart from wait construct, we also need release resources support

StartProgram()
while(true)
{
    RestartFromSavedState() // if any state is saved previously.
    StartBatch()
        Streaming Application starts, runs and finishes
    EndBatch()
    SaveState()
    RelaseResources()
    WaitTill(time) or WaitFor(timeperiod)
}
EndProgram()


All the constructs : waitTime(), RestartFromSavedState(), SaveState()
, RelaseResources()
could be very well be part of StartBatch() or EndBatch(). I have put them
separately for clear understanding only.

Another point to think on would be scheduler. A batch job is generally
triggered as a cron job. Do we still see Apex jobs being triggered by cron
or would like to include a scheduler within Apex that will trigger jobs
based on time or on some external trigger or even polling for events.

Regards
Sandeep

On Mon, Dec 28, 2015 at 5:11 PM, Bhupesh Chawda <bhupesh@datatorrent.com>
wrote:

> +1
>
> I think in the batch case, application windows may be transparent to the
> user application / operator logic.  A batch can be thought of as one
> instantiation of a Apex Dag, from setup() to teardown() for all operators.
> May be we need to define a higher level API which encapsulates a streaming
> application.
> Something like:
>
> StartBatch()
>   Streaming Application starts, runs and finishes
> EndBatch()
>
> The streaming application will run transparently with all the windowing /
> checkpointing logic that it currently does. Checkpointing large amounts of
> data may be avoided by either checkpointing at large intervals or even
> disabling checkpointing for the batch job.
> Additionally, the external trigger (existence of some file etc. ) can be
> controlled by the StartBatch() and EndBatch() calls. In all the batch use
> cases, it is usually the case that once the input is processed completely,
> the batch is done. Example: In map reduce all splits processed means batch
> job is done. Similar primitives can be supported by Apex in order to
> facilitate the control management in the StartBatch() and EndBatch()
> methods.
>
> -Bhupesh
>
> On Mon, Dec 28, 2015 at 1:34 PM, Thomas Weise <thomas@datatorrent.com>
> wrote:
>
> > Following JIRA is open to enhance the support for batch:
> >
> > https://issues.apache.org/jira/browse/APEXCORE-235
> >
> > One of the challenges with batch on Apex today is that there isn't any
> > native support to identify begin/end of batch and associate actions to
> it.
> > For example, at the beginning we may want to fetch some data needed for
> all
> > subsequent processing and at the end perform some finalization action or
> > push to external system (add partition to Hive table or similar).
> >
> > Absent native support, the workaround is to add a bunch of ports and
> extra
> > operators for propagation and synchronization purposes, which makes
> > building the batch application with standard operators or development of
> > custom operators rather difficult and inefficient.
> >
> > The span of a batch can also be seen as a user defined window, with logic
> > for begin and end. The current "application window" support is limited
> to a
> > multiple of streaming window on a per operator basis. In the batch case,
> > the boundary needs to be more flexible - user code needs to be able to
> > determine begin/endWindow based on external data (existence of files
> etc.).
> >
> > There is another commonality with application window, and that's
> alignment
> > of checkpointing. For batches where it is more efficient to redo the
> > processing instead of checkpointing potentially large amounts of
> > intermediate state for incremental recovery, it would be nice to be able
> to
> > say "user window == checkpoint interval".
> >
> > This is to float the idea of having a window control that can be
> influenced
> > by user code. An operator that identifies the batch boundary tells the
> > engine about it and corresponding control tuples are submitted through
> the
> > stream, leading to callbacks on downstream operators. These control
> > tuples should
> > be able to carry contextual information that can be used in downstream
> > operator logic (file names, schema information etc.)
> >
> > I don't expect the current beginWindow/endWindow can be augmented in a
> > backward compatible way to accommodate this, but a similar optional
> > interface could be supported to enable batch aware operators and
> > checkpointing optimization.
> >
> > Thoughts?
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message