apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Weise <...@apache.org>
Subject Re: [DISCUSS] Proposal for adapting Malhar operators for batch use cases
Date Tue, 17 Jan 2017 16:29:29 GMT
The HDFS source can operate in two modes, bounded or unbounded. If you scan
only once, then it should emit the final watermark after it is done.
Otherwise it would emit watermarks based on a policy (files names etc.).
The mechanism to generate the marks may depend on the type of source and
the user needs to be able to influence/configure it.

Thomas


On Tue, Jan 17, 2017 at 5:03 AM, Bhupesh Chawda <bhupesh@datatorrent.com>
wrote:

> Hi Thomas,
>
> I am not sure that I completely understand your suggestion. Are you
> suggesting to broaden the scope of the proposal to treat all sources as
> bounded as well as unbounded?
>
> In case of Apex, we treat all sources as unbounded sources. Even bounded
> sources like HDFS file source is treated as unbounded by means of scanning
> the input directory repeatedly.
>
> Let's consider HDFS file source for example:
> In this case, if we treat it as a bounded source, we can define hooks which
> allows us to detect the end of the file and send the "final watermark". We
> could also consider HDFS file source as a streaming source and define hooks
> which send watermarks based on different kinds of windows.
>
> Please correct me if I misunderstand.
>
> ~ Bhupesh
>
>
> On Mon, Jan 16, 2017 at 9:23 PM, Thomas Weise <thw@apache.org> wrote:
>
> > Bhupesh,
> >
> > Please see how that can be solved in a unified way using windows and
> > watermarks. It is bounded data vs. unbounded data. In Beam for example,
> you
> > can use the "global window" and the final watermark to accomplish what
> you
> > are looking for. Batch is just a special case of streaming where the
> source
> > emits the final watermark.
> >
> > Thanks,
> > Thomas
> >
> >
> > On Mon, Jan 16, 2017 at 1:02 AM, Bhupesh Chawda <bhupesh@datatorrent.com
> >
> > wrote:
> >
> > > Yes, if the user needs to develop a batch application, then batch aware
> > > operators need to be used in the application.
> > > The nature of the application is mostly controlled by the input and the
> > > output operators used in the application.
> > >
> > > For example, consider an application which needs to filter records in a
> > > input file and store the filtered records in another file. The nature
> of
> > > this app is to end once the entire file is processed. Following things
> > are
> > > expected of the application:
> > >
> > >    1. Once the input data is over, finalize the output file from .tmp
> > >    files. - Responsibility of output operator
> > >    2. End the application, once the data is read and processed -
> > >    Responsibility of input operator
> > >
> > > These functions are essential to allow the user to do higher level
> > > operations like scheduling or running a workflow of batch applications.
> > >
> > > I am not sure about intermediate (processing) operators, as there is no
> > > change in their functionality for batch use cases. Perhaps, allowing
> > > running multiple batches in a single application may require similar
> > > changes in processing operators as well.
> > >
> > > ~ Bhupesh
> > >
> > > On Mon, Jan 16, 2017 at 2:19 PM, Priyanka Gugale <priyag@apache.org>
> > > wrote:
> > >
> > > > Will it make an impression on user that, if he has a batch usecase he
> > has
> > > > to use batch aware operators only? If so, is that what we expect? I
> am
> > > not
> > > > aware of how do we implement batch scenario so this might be a basic
> > > > question.
> > > >
> > > > -Priyanka
> > > >
> > > > On Mon, Jan 16, 2017 at 12:02 PM, Bhupesh Chawda <
> > > bhupesh@datatorrent.com>
> > > > wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > While design / implementation for custom control tuples is
> ongoing, I
> > > > > thought it would be a good idea to consider its usefulness in one
> of
> > > the
> > > > > use cases -  batch applications.
> > > > >
> > > > > This is a proposal to adapt / extend existing operators in the
> Apache
> > > > Apex
> > > > > Malhar library so that it is easy to use them in batch use cases.
> > > > > Naturally, this would be applicable for only a subset of operators
> > like
> > > > > File, JDBC and NoSQL databases.
> > > > > For example, for a file based store, (say HDFS store), we could
> have
> > > > > FileBatchInput and FileBatchOutput operators which allow easy
> > > integration
> > > > > into a batch application. These operators would be extended from
> > their
> > > > > existing implementations and would be "Batch Aware", in that they
> may
> > > > > understand the meaning of some specific control tuples that flow
> > > through
> > > > > the DAG. Start batch and end batch seem to be the obvious
> candidates
> > > that
> > > > > come to mind. On receipt of such control tuples, they may try to
> > modify
> > > > the
> > > > > behavior of the operator - to reinitialize some metrics or finalize
> > an
> > > > > output file for example.
> > > > >
> > > > > We can discuss the potential control tuples and actions in detail,
> > but
> > > > > first I would like to understand the views of the community for
> this
> > > > > proposal.
> > > > >
> > > > > ~ Bhupesh
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message