flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinay Patil <vinay18.pa...@gmail.com>
Subject Handling stale data enrichment
Date Thu, 23 Apr 2020 11:28:51 GMT

I went through Konstantin webinar on 99 ways you can do enrichment. One
thing I am failing to understand is how do we efficiently handle stale data

Context: Let's say I want to enrich user data with the subscription data.
Here subscription data is acting as reference data and will be used for
joining these two streams based on event time. Consider the following

   1. We are going to enrich Click Stream event (containing user_info) with
   Subscription details
   2. Subscription Status for Alice user is FREE
   3. Current Internal State contains Alice with Subscription status as FREE

   Reference data is not flowing because of some issue for 2hrs

   Alice upgraded the subscription to Premium at 10.30 AM

   Watched video event comes for Alice at 10.40 AM

      flink pipeline looks up in internal state and writes to enrichment

      Enrichment topic now contains Alice -> FREE

   Reference data starts flowing in at 11AM

      let's assume we consider late elements upto 2 hours, so the click
      stream event of Alice is still buffered in the state
      - The enrichment topic will now contain duplicate records for Alice
      because of multiple firings of window
   1. Alice -> FREE -> 10 AM
      2. Alice -> PREMIUM -> 11 AM

Question is how do I avoid sending duplicate records ? I am not able to
understand it. I can think of Low Level joins but not sure how do we know
if it is stale data or not based on timestamp (watermark) as it can happen
that a particular enriched record is not updated for 6 hrs.

Vinay Patil

View raw message