flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anton Polyakov <polyakov.an...@gmail.com>
Subject Re: Watermarks as "process completion" flags
Date Mon, 30 Nov 2015 10:03:44 GMT
I think I can turn my problem into a simpler one.

Effectively what I need - I need way to checkpoint certain events in input
stream and once this checkpoint reaches end of DAG take some action. So I
need a signal at the sink which can tell "all events in source before
checkpointed event are now processed".

As far as I understand flagged record don't quite work since DAG doesn't
propagate source events one-to-one. Some transformations might create 3
child events out of 1 source. If I want to make sure I fully processed
source event, I need to wait till all childs are processed.

On Sun, Nov 29, 2015 at 4:12 PM, Anton Polyakov <polyakov.anton@gmail.com>

> Hi Fabian
> Defining a special flag for record seems like a checkpoint barrier. I
> think I will end up re-implementing checkpointing myself. I found the
> discussion in flink-dev:
> mail-archives.apache.org/mod_mbox/flink-dev/201511.mbox/…
> <http://mail-archives.apache.org/mod_mbox/flink-dev/201511.mbox/%3CCA+faj9xDFAUG_zi==E2H8s-8R4cn8ZBDON_hf+1Rud5pJqvZ4A@mail.gmail.com%3E>
> seems to solve my task. Essentially they want to have a mechanism which
> will mark record produced by job as “last” and then wait until it’s fully
> propagated through DAG. Similarly to what I need. Essentially my job which
> produces trades can also thought as being finished once it produced all
> trades, then I just need to wait till latest trade produced by this job is
> processed.
> So although windows can probably also be applied, I think propagating
> barrier through DAG and checkpointing at final job is what I need.
> Can I possibly utilize internal Flink’s checkpoint barriers (i.e. like
> triggering a custom checkoint or finishing streaming job)?
> On 24 Nov 2015, at 21:53, Fabian Hueske <fhueske@gmail.com> wrote:
> Hi Anton,
> If I got your requirements right, you are looking for a solution that
> continuously produces updated partial aggregates in a streaming fashion.
> When a  special event (no more trades) is received, you would like to store
> the last update as a final result. Is that correct?
> You can compute continuous updates using a reduce() or fold() function.
> These will produce a new update for each incoming event.
> For example:
> val s: DataStream[(Int, Long)] = ...
> s.keyBy(_._1)
>   .reduce( (x,y) => (x._1, y._2 + y._2) )
> would continuously compute a sum for every key (_._1) and produce an
> update for each incoming record.
> You could add a flag to the record and implement a ReduceFunction that
> marks a record as final when the no-more-trades event is received.
> With a filter and a data sink you could emit such final records to a
> persistent data store.
> Btw.: You can also define custom trigger policies for windows. A custom
> trigger is called for each element that is added to a window and when
> certain timers expire. For example with a custom trigger, you can evaluate
> a window for every second element that is added. You can also define
> whether the elements in the window should be retained or removed after the
> evaluation.
> Best, Fabian
> 2015-11-24 21:32 GMT+01:00 Anton Polyakov <polyakov.anton@gmail.com>:
>> Hi Max
>> thanks for reply. From what I understand window works in a way that it
>> buffers records while window is open, then apply transformation once window
>> close is triggered and pass transformed result.
>> In my case then window will be open for few hours, then the whole amount
>> of trades will be processed once window close is triggered. Actually I want
>> to process events as they are produced without buffering them. It is more
>> like a stream with some special mark versus windowing seems more like a
>> batch (if I understand it correctly).
>> In other words - buffering and waiting for window to close, then
>> processing will be equal to simply doing one-off processing when all events
>> are produced. I am looking for a solution when I am processing events as
>> they are produced and when source signals "done" my processing is also
>> nearly done.
>> On Tue, Nov 24, 2015 at 2:41 PM, Maximilian Michels <mxm@apache.org>
>> wrote:
>>> Hi Anton,
>>> You should be able to model your problem using the Flink Streaming
>>> API. The actions you want to perform on the streamed records
>>> correspond to transformations on Windows. You can indeed use
>>> Watermarks to signal the window that a threshold for an action has
>>> been reached. Otherwise an eviction policy should also do it.
>>> Without more details about what you want to do I can only refer you to
>>> the streaming API documentation:
>>> Please see
>>> https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/streaming_guide.html
>>> Thanks,
>>> Max
>>> On Sun, Nov 22, 2015 at 8:53 PM, Anton Polyakov
>>> <polyakov.anton@gmail.com> wrote:
>>> > Hi
>>> >
>>> > I am very new to Flink and in fact never used it. My task (which I
>>> currently solve using home grown Redis-based solution) is quite simple - I
>>> have a system which produces some events (trades, it is a financial system)
>>> and computational chain which computes some measure accumulatively over
>>> these events. Those events form a long but finite stream, they are produced
>>> as a result of end of day flow. Computational logic forms a processing DAG
>>> which computes some measure over these events (VaR). Each trade is
>>> processed through DAG and at different stages might produce different set
>>> of subsequent events (like return vectors), eventually they all arrive into
>>> some aggregator which computes accumulated measure (reducer).
>>> >
>>> > Ideally I would like to process trades as they appear (i.e. stream
>>> them) and once producer reaches end of portfolio (there will be no more
>>> trades), I need to write final resulting measure and mark it as “end of day
>>> record”. Of course I also could use a classical batch - i.e. wait until all
>>> trades are produced and then batch process them, but this will be too
>>> inefficient.
>>> >
>>> > If I use Flink, I will need a sort of watermark saying - “done, no
>>> more trades” and once this watermark reaches end of DAG, final measure can
>>> be saved. More generally would be cool to have an indication at the end of
>>> DAG telling to which input stream position current measure corresponds.
>>> >
>>> > I feel my problem is very typical yet I can’t find any solution. All
>>> examples operate either on infinite streams where nobody cares about
>>> completion or classical batch examples which rely on fact all input data is
>>> ready.
>>> >
>>> > Can you please hint me.
>>> >
>>> > Thank you vm
>>> > Anton

View raw message