flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kathleen Sharp <kathleen.sh...@signavio.com>
Subject Re: A question about iterations and prioritizing "control" over "data" inputs
Date Wed, 15 Mar 2017 16:58:43 GMT

I have a similar sounding use case and just yesterday was
experimenting with this approach:

Use 2 separate streams: one for model events, one for data events.
Connect these 2, key the resulting stream and then use a
RichCoFlatMapFunction to ensure that each data event is enriched with
the latest model event as soon as a new model event arrives.
Also, as soon as a new model arrives emit all previously seen events
with this new model events.
This involves keeping events and models in state.
My emitted enriched events have a command-like syntax (add/remove) so
that downstream operators can remove/add as necessary depending on the
calculations (so for each model change I would emit an add/remove pair
of enriched events).

As I say I have only experimented with this yesterday, perhaps someone
a bit more experienced with flink might spot some problems with this
approach, which I would definitely be interested in hearing.


On Wed, Mar 15, 2017 at 2:20 PM, Theodore Vasiloudis
<theodoros.vasiloudis@gmail.com> wrote:
> Hello all,
> I've started thinking about online learning in Flink and one of the issues
> that has come
> up in other frameworks is the ability to prioritize "control" over "data"
> events in iterations.
> To set an example, say we develop an ML model, that ingests events in
> parallel, performs
> an aggregation to update the model, and then broadcasts the updated model to
> back through
> an iteration/back edge. Using the above nomenclature the events being
> ingested would be
> "data" events, and the model update would a "control" event.
> I talked about this scenario a bit with couple of people (Paris and
> Gianmarco) and one thing
> we would like to have is the ability to prioritize the ingestion of control
> events over the data events.
> If my understanding is correct, currently there is a buffer/queue of events
> waiting to be processed
> for each operator, and each incoming event ends up at the end of that queue.
> If our data source is fast, and the model updates slow, a lot of data events
> might be buffered/scheduled
> to be processed before each model update, because of the speed difference
> between the two
> streams. But we would like to update the model that is used to process data
> events as soon as
> the newest version becomes available.
> Is it somehow possible to make the control events "jump" the queue and be
> processed as soon
> as they arrive over the data events?
> Regards,
> Theodore
> P.S. This is still very much a theoretical problem, I haven't looked at how
> such a pipeline would
> be implemented in Flink.

View raw message