flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Knauf <kna...@apache.org>
Subject Re: Handling stale data enrichment
Date Fri, 24 Apr 2020 06:43:58 GMT
Hi Vinay,

I assume your subscription updates also have a timestamp and a watermark.
Otherwise, there is no way for Flink to tell that the subscription updates
are late.

If you use a "temporal table "-style join to join the two streams, and you
do not receive any subscription updates for 2 hours, the watermark will not
advance (it is the minimum of the two input streams) and hence all click
events will be buffered. No output. This has the advantage of not sending
out duplicate records, but the disadvantage that you do not make any
progress until you see fresh subscription data. Is this the desired
behavior for your use case?



On Thu, Apr 23, 2020 at 1:29 PM Vinay Patil <vinay18.patil@gmail.com> wrote:

> Hi,
> I went through Konstantin webinar on 99 ways you can do enrichment. One
> thing I am failing to understand is how do we efficiently handle stale data
> enrichment.
> Context: Let's say I want to enrich user data with the subscription data.
> Here subscription data is acting as reference data and will be used for
> joining these two streams based on event time. Consider the following
> scenario:
>    1. We are going to enrich Click Stream event (containing user_info)
>    with Subscription details
>    2. Subscription Status for Alice user is FREE
>    3. Current Internal State contains Alice with Subscription status as
>    FREE
>    4.
>    Reference data is not flowing because of some issue for 2hrs
>    5.
>    Alice upgraded the subscription to Premium at 10.30 AM
>    6.
>    Watched video event comes for Alice at 10.40 AM
>    -
>       flink pipeline looks up in internal state and writes to enrichment
>       topic
>       -
>       Enrichment topic now contains Alice -> FREE
>       7.
>    Reference data starts flowing in at 11AM
>    -
>       let's assume we consider late elements upto 2 hours, so the click
>       stream event of Alice is still buffered in the state
>       - The enrichment topic will now contain duplicate records for Alice
>       because of multiple firings of window
>    1. Alice -> FREE -> 10 AM
>       2. Alice -> PREMIUM -> 11 AM
> Question is how do I avoid sending duplicate records ? I am not able to
> understand it. I can think of Low Level joins but not sure how do we know
> if it is stale data or not based on timestamp (watermark) as it can happen
> that a particular enriched record is not updated for 6 hrs.
> Regards,
> Vinay Patil


Konstantin Knauf

View raw message