apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gaurav Gupta <gau...@datatorrent.com>
Subject Re: Including meta data with input tuples
Date Thu, 21 Jan 2016 21:00:13 GMT
I have raised JIRA to track this issue
https://issues.apache.org/jira/browse/APEXMALHAR-1981

Thanks
-Gaurav

On Wed, Nov 18, 2015 at 7:05 AM, Amol Kekre <amol@datatorrent.com> wrote:

> That makes sense. But then this should not be ON by default as per-tuple
> cost is high. meta data will also help with ask from Ilya for ability to
> add latency as meta-data per-tuple.
>
> Thks,
> Amol
>
>
> On Wed, Nov 18, 2015 at 1:03 AM, Sandeep Deshmukh <sandeep@datatorrent.com
> >
> wrote:
>
> >    1. Potentially each tuple can have different meta-data and hence
> sending
> >    meta-data and data tuples separately is not a good idea. Example could
> > be
> >    tuple incoming time which will vary for each tuple. In such a case,
> data
> >    and meta-data should be* tightly coupled*.
> >    2. In case of separate meta-data tuple mechanism, schema will be
> >    different for data tuple and meta-data tuple, which will make things
> > messy.
> >    3. Partitioning will pose a problem as data & meta-data tuples need to
> >    be passed on to the same partition
> >
> >
> > I would vote for  mechanism to bundle meta-data in the tuple, and schema
> to
> > worry only about the data.
> >
> > Regards,
> > Sandeep
> >
> > On Wed, Nov 18, 2015 at 11:14 AM, Gaurav Gupta <gaurav@datatorrent.com>
> > wrote:
> >
> > > Yes in worst case we’ll have meta data followed by data for every
> tuple.
> > >
> > > Data schema will only have id / reference of meta data instead of whole
> > > meta data
> > >
> > > Thanks
> > > - Gaurav
> > >
> > > > On Nov 17, 2015, at 9:39 PM, Bhupesh Chawda <bhupesh@datatorrent.com
> >
> > > wrote:
> > > >
> > > > Ok, so in the worst case, we'll have meta data followed by data for
> > every
> > > > tuple.
> > > > However, in this case we need to include the meta data as part of the
> > > data
> > > > schema itself so as to allow the parser to process data and meta data
> > in
> > > a
> > > > common way. This is similar to option 1 in the first email.
> > > >
> > > >
> > > > Thanks.
> > > > Bhupesh
> > > >
> > > > On Wed, Nov 18, 2015 at 11:02 AM, Gaurav Gupta <
> gaurav@datatorrent.com
> > >
> > > > wrote:
> > > >
> > > >> Bhupesh,
> > > >>
> > > >> No it doesn’t stall anything… Meta data and data tuples go on
same
> > port.
> > > >> Whenever there is a change in meta data, send the meta data first
> and
> > > then
> > > >> tuples following it. So the first tuple that arrives which has
> > different
> > > >> meta data, will trigger sending of new meta data.
> > > >>
> > > >> Thanks
> > > >> - Gaurav
> > > >>
> > > >>> On Nov 17, 2015, at 9:28 PM, Bhupesh Chawda <
> bhupesh@datatorrent.com
> > >
> > > >> wrote:
> > > >>>
> > > >>> Depends on how "real time" the scenario is.
> > > >>> I think sending it only once during a window might work for some
> use
> > > >> cases.
> > > >>> If my understanding is correct, this essentially stalls the
> > processing
> > > >> of a
> > > >>> window until the meta data is available which is not until end
> window
> > > of
> > > >>> the upstream operator.
> > > >>>
> > > >>> Thanks
> > > >>> -Bhupesh
> > > >>>
> > > >>>
> > > >>> On Wed, Nov 18, 2015 at 10:54 AM, Gaurav Gupta <
> > gaurav@datatorrent.com
> > > >
> > > >>> wrote:
> > > >>>
> > > >>>> Bhupesh,
> > > >>>>
> > > >>>> If the requirement is to send meta data with every tuple then
it
> > > should
> > > >> be
> > > >>>> send with data schema itself.
> > > >>>> Can sending meta data be optimized the way platform does with
> > > >>>> DefaultStatefulStreamCodec. I mean send the meta data only
once
> in a
> > > >> window
> > > >>>> and all the tuples that are associated with this meta data
have
> this
> > > >> meta
> > > >>>> data’s id.
> > > >>>>
> > > >>>> Thanks
> > > >>>> - Gaurav
> > > >>>>
> > > >>>>> On Nov 17, 2015, at 8:20 PM, Bhupesh Chawda <
> > bhupesh@datatorrent.com
> > > >
> > > >>>> wrote:
> > > >>>>>
> > > >>>>> Hi All,
> > > >>>>>
> > > >>>>> In the design of input modules, we are facing situations
where we
> > > might
> > > >>>>> need to pass on some meta data to the downstream modules,
in
> > addition
> > > >> to
> > > >>>>> actual data. Further, this meta data may need to be sent
per
> > record.
> > > An
> > > >>>>> example use case is to send a record and additionally
send the
> file
> > > >> name
> > > >>>>> (as meta data) from which the record was read. Another
example is
> > > >> sending
> > > >>>>> out the kafka topic information along with the message.
> > > >>>>>
> > > >>>>> We are exploring options on:
> > > >>>>>
> > > >>>>> 1. Whether to include the meta information in the data
schema, so
> > as
> > > >> to
> > > >>>>> allow the parser to handle this data as regular data.
This will
> > > >> involve
> > > >>>>> changing the schema of the data.
> > > >>>>> 2. Whether to handle meta data separately and modify the
> behaviour
> > of
> > > >>>>> parser / converter to handle meta data separately as well.
> > > >>>>> 3. Use additional ports to transfer such meta data depending
on
> > > >>>>> different modules.
> > > >>>>> 4. Any other option
> > > >>>>>
> > > >>>>> Please comment.
> > > >>>>>
> > > >>>>> Consolidating comments on another thread here:
> > > >>>>>
> > > >>>>> 1. Have the tuple containing two parts, with the downstream
> parser
> > > >>>>> ignoring the meta data
> > > >>>>> 1. Data
> > > >>>>> 2. Meta-data
> > > >>>>> 2. Use option 1, but concern regarding how unifiers will
treat
> meta
> > > >>>>> data, if they need to unify that as well.
> > > >>>>> 3. Another comment is to have a centralized meta data
repo. This
> > may
> > > >> be
> > > >>>>> in memory as well, may be as a separate operator which
stores and
> > > >>>> serves
> > > >>>>> the meta data to other operators.
> > > >>>>>
> > > >>>>> Thanks.
> > > >>>>>
> > > >>>>> -Bhupesh
> > > >>>>
> > > >>>>
> > > >>
> > > >>
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message