beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kenneth Knowles (JIRA)" <>
Subject [jira] [Commented] (BEAM-1641) Support synchronized processing time in Flink runner
Date Thu, 23 Mar 2017 04:16:41 GMT


Kenneth Knowles commented on BEAM-1641:

Since this means that processing time triggers cannot work in pipelines with more than one
GBK in sequence, I think we should devise a mitigation.

The spec for continuation trigger is rather undefined, but is "fire as fast as reasonable
so data from upstream 'just flows' but not so fast that you miss data". I doubt the trigger
language is actually closed under this operation with any precision.

There are some esoteric examples where this is a correctness issue: If you have _only_ a processing
time trigger then it will finish and discard all data. You definitely need the second GBK
to wait until it gets the data or it will drop it.

I actually favor making trigger "finishing" less strict, and having the GC firing still emit
all remaining data. Especially now that state & timers can address advanced cases like
"drop all data after 100 elements because that is probably enough" (which is almost certainly
not data-sensitive enough anyhow). In this case, it might be OK for the continuation trigger
to be a processing time trigger and it might miss something but the GC will pick it up.

All that said - perhaps in Flink you can ignore continuation triggers and use some special
punctuation(s) to trigger?

> Support synchronized processing time in Flink runner
> ----------------------------------------------------
>                 Key: BEAM-1641
>                 URL:
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-flink
>            Reporter: Kenneth Knowles
>            Assignee: Aljoscha Krettek
> The "continuation trigger" for a processing time trigger is a synchronized processing
time trigger. Today, this throws an exception in the FlinkRunner.
> The supports the following:
>  - GBK1
>  - GBK2
> When GBK1 fires due to processing time past the first element in the pane and that element
arrives at GBK2, it will wait until all the other upstream keys have also processed and emitted
corresponding data.
> Sorry for the terseness of explanation - writing quickly so I don't forget.

This message was sent by Atlassian JIRA

View raw message