beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amit Sela (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (BEAM-696) Side-Inputs non-deterministic with merging main-input windows
Date Tue, 11 Oct 2016 20:41:20 GMT

    [ https://issues.apache.org/jira/browse/BEAM-696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566498#comment-15566498
] 

Amit Sela edited comment on BEAM-696 at 10/11/16 8:41 PM:
----------------------------------------------------------

Does Dataflow "buffer until trigger..." if there are no sideInputs assigned ?

Combiners are a very important optimization (Spark for sure, but I guess other runners too),
and Sessions (or any other merging windows) can be used without sideInput, so I guess a runner
should defer *only* for merging windows and *only* if they are used with sideInputs..

I think my question is: where do we draw the line ?

I could argue that in order to use sideInputs for merging windows a pipeline author should
use explicit {{GroupByKey}} followed by {{Combine.GroupedValues}} or risk a non-deterministic
result.
There are analytical cases where you actually want to do that such as identifying a sequence
of events in a time frame. It's clear you can't use combiners here and are willing to pay
the price of shuffling and grouping the events (+maintaining non-compactable state).

I don't know if you have/can access such statistics, but I wonder what % of pipelines with
sessions also use sideInputs ?


was (Author: amitsela):
Does Dataflow "buffer until trigger..." if there are no sideInputs assigned ?

Combiners are a very important optimization (Spark for sure, but I guess other runners too),
and Sessions (or any other merging windows) can be used without sideInput, so I guess a runner
should defer *only* for merging windows and *only* if they are used with sideInputs..

I think my question is: where do we draw the line ?

I could argue that in order to use sideInputs for merging windows a pipeline author should
use explicit {{GroupByKey}} followed by {{Combine.GroupedValues}} or risk a non-deterministic
result.
There are analytical cases where you actually want to do that such as identifying a sequence
of events in a time frame. It's clear you can't use combiners here and are willing to pay
the price of shuffling and grouping the events (+maintaining non-compactable state).

> Side-Inputs non-deterministic with merging main-input windows
> -------------------------------------------------------------
>
>                 Key: BEAM-696
>                 URL: https://issues.apache.org/jira/browse/BEAM-696
>             Project: Beam
>          Issue Type: Bug
>          Components: beam-model
>            Reporter: Ben Chambers
>            Assignee: Pei He
>
> Side-Inputs are non-deterministic for several reasons:
> 1. Because they depend on triggering of the side-input (this is acceptable because triggers
are by their nature non-deterministic).
> 2. They depend on the current state of the main-input window in order to lookup the side-input.
This means that with merging
> 3. Any runner optimizations that affect when the side-input is looked up may cause problems
with either or both of these.
> This issue focuses on #2 -- the non-determinism of side-inputs that execute within a
Merging WindowFn.
> Possible solution would be to defer running anything that looks up the side-input until
we need to extract an output, and using the main-window at that point. Specifically, if the
main-window is a MergingWindowFn, don't execute any kind of pre-combine, instead buffer all
the inputs and combine later.
> This could still run into some non-determinism if there are triggers controlling when
we extract output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message