flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vishal Santoshi <vishal.santo...@gmail.com>
Subject Re: Event time join
Date Thu, 08 Mar 2018 16:45:44 GMT
This is very interesting.  I would imagine that there will be high back
pressure on the LEFT source effectively throttling it but as is the current
state that is likely effect other pipelines as the free o/p buffer on the
source side and and i/p buffers on the consumer side start blocking and get
exhausted for all other pipes. I am very interested in how holding back the
busy source does not create a pathological  issue where that source is
forever held back. Is there a FLIP for it ?

On Thu, Mar 8, 2018 at 11:29 AM, Fabian Hueske <fhueske@gmail.com> wrote:

> Hi Gytis,
>
> Flink does currently not support holding back individual streams, for
> example it is not possible to align streams on (offset) event-time.
>
> However, the Flink community is working on a windowed join for the
> DataStream API, that only holds the relevant tail of the stream as state.
> If your join condition is +/- 5 minutes then, the join would store he last
> five minutes of both streams as state. Here's an implementation of the
> operator [1] that is close to be merged and will be available in Flink
> 1.6.0.
> Flink's SQL support (and Table API) support this join type since version
> 1.4.0 [2].
>
> Best, Fabian
>
> [1] https://github.com/apache/flink/pull/5342
> [2] https://ci.apache.org/projects/flink/flink-docs-
> release-1.4/dev/table/sql.html#joins
>
> 2018-03-08 1:02 GMT-08:00 Gytis Žilinskas <gytis.zilinskas@gmail.com>:
>
>> Hi,
>>
>> we're considering flink for a couple of our projects. I'm doing a
>> trial implementation for one of them. So far, I like a lot of things,
>> however there are a couple of issues that I can't figure out how to
>> resolve. Not sure if it's me misunderstanding the tool, or flink just
>> doesn't have a capability to do it.
>>
>> We want to do an event time join on two big kafka streams. Both of
>> them might experience some issues on the other end and be delayed.
>> Additionally, while both are big, one (let's call it stream A) is
>> significantly larger than stream B.
>>
>> We also know, that the join window is around 5min. That is, given some
>> key K in stream B, if there is a counterpart in stream A, it's going
>> to be +/5 5min in event time.
>>
>> Since stream A is especially heavy and it's unfeasable to keep hours
>> of it in memory, I would imagine an ideal solution where we read both
>> streams from Kafka. We always make sure that stream B is ahead by
>> 10min, that is, if stream A is currently ahead in watermarks, we stall
>> it and consume stream B until it catches up. Once the stream are
>> alligned in event time (with the 10min delay window) we run them both
>> through join.
>>
>> The problem is, that I find a mechanism to implement that in flink. If
>> I try to do a CoProcessFunction then it just consumes both streams at
>> the same time, ingests a lot of messages from stream A, runs out of
>> memory and dies.
>>
>> Any ideas on how this could be solved?
>>
>> (here's a thread with a very similar problem from some time ago
>> http://apache-flink-user-mailing-list-archive.2336050.n4.
>> nabble.com/synchronizing-two-streams-td6830.html)
>>
>> Regards,
>> Gytis
>>
>
>

Mime
View raw message