apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chinmay Kolhatkar <chin...@datatorrent.com>
Subject Re: Schema Discovery Support in Apex Applications
Date Tue, 31 Jan 2017 06:44:09 GMT
Consumer of output port operator schema is going next downstream operator.


On Tue, Jan 31, 2017 at 4:01 AM, Sergey Golovko <sergey@datatorrent.com>
wrote:

> Sorry, I’m a new person in the APEX team. And I don't understand clearly
> who are consumers of the output port operator schema(s).
>
> 1. If the consumers are non-run-time callers like the application manager
> or UI designer, maybe it makes sense to use Java static method(s) to
> retrieve the output port operator schema(s). I guess the performance of a
> single call of a static method via reflection can be ignored.
>
> 2. If the consumer is next downstream operator, maybe it makes sense to
> send an output port operator schema from upstream operator to next
> downstream operator via the stream. The corresponded methods that would
> send and receive the schema should be declared in the
> interface/abstract-class of the upstream and downstream operators. The
> sending/receiving of an output schema should be processed right before the
> sending of the first data record via the stream.
>
> One of examples of a typical implementation for sending of metadata with a
> regular result set is the sending of JDBC metadata as a part of JDBC result
> set. And I hope the output schema (metadata of the streamed data) in the
> implementation should contain not only a signature of the streamed objects
> (like field names and data types), but also any other properties of the
> data that can be useful by the schema receiver to process the data (for
> instance, a delimiter for CSV record stream).
>
> Thanks,
> Sergey
>
> On 2017-01-25 01:47 (-0800), Chinmay Kolhatkar <chinmay@datatorrent.com>
> wrote:
> > Thank you all for the feedback.
> >
> > I've created a Jira for this: APEXCORE-623 and I'll attach the same
> > document and link to this mailchain there.
> >
> > As a first part of this Jira, there are 2 steps I would like to propose:
> > 1. Add following interface at com.datatorrent.common.util.SchemaAware.
> >
> > interface SchemaAware {
> >
> > Map<OutputPort, Schema> registerSchema(Map<InputPort, Schema>
> inputSchema);
> > }
> >
> > This interface can be implemented by Operators to communicate its output
> > schema(s) to engine.
> > Input to this schema will be schema at its input port.
> >
> > 2. After LogicalPlan is created call SchemaAware method from upstream to
> > downstream operator in the DAG to propagate the Schema.
> >
> > Once this is done, changes can be done in Malhar for the operators in
> > question.
> >
> > Please share your opinion on this approach.
> >
> > Thanks,
> > Chinmay.
> >
> >
> >
> >
> > On Wed, Jan 18, 2017 at 2:31 PM, Priyanka Gugale <priyag@apache.org>
> wrote:
> >
> > > +1 to have this feature.
> > >
> > > -Priyanka
> > >
> > > On Tue, Jan 17, 2017 at 9:18 PM, Pramod Immaneni <
> pramod@datatorrent.com>
> > > wrote:
> > >
> > > > +1
> > > >
> > > > On Mon, Jan 16, 2017 at 1:23 AM, Chinmay Kolhatkar <
> chinmay@apache.org>
> > > > wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > Currently a DAG that is generated by user, if contains any POJOfied
> > > > > operators, TUPLE_CLASS attribute needs to be set on each and every
> port
> > > > > which receives or sends a POJO.
> > > > >
> > > > > For e.g., if a DAG is like File -> Parser -> Transform ->
Dedup ->
> > > > > Formatter -> Kafka, then TUPLE_CLASS attribute needs to be set
by
> user
> > > on
> > > > > both input and output ports of transform, dedup operators and also
> on
> > > > > parser output and formatter input.
> > > > >
> > > > > The proposal here is to reduce work that is required by user to
> > > configure
> > > > > the DAG. Technically speaking if an operators knows input schema
> and
> > > > > processing properties, it can determine output schema and convey
> it to
> > > > > downstream operators. This way the complete pipeline can be
> configured
> > > > > without user setting TUPLE_CLASS or even creating POJOs and adding
> them
> > > > to
> > > > > classpath.
> > > > >
> > > > > On the same idea, I want to propose an approach where the pipeline
> can
> > > be
> > > > > configured without user setting TUPLE_CLASS or even creating POJOs
> and
> > > > > adding them to classpath.
> > > > > Here is the document which at a high level explains the idea and
a
> high
> > > > > level design:
> > > > > https://docs.google.com/document/d/1ibLQ1KYCLTeufG7dLoHyN_
> > > > > tRQXEM3LR-7o_S0z_porQ/edit?usp=sharing
> > > > >
> > > > > I would like to get opinion from community about feasibility and
> > > > > applications of this proposal.
> > > > > Once we get some consensus we can discuss the design in details.
> > > > >
> > > > > Thanks,
> > > > > Chinmay.
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message