I've been struggling with an implementation problem in the last days, which I am almost sure caused by my misunderstanding of Flink.
The purpose: consume multiple streams, update a score list (with meta data e.g. user_id) for each update coming from any of the streams. The new output list will also need to be used by another pattern.
- We created 3 SourceFunctions, that periodically go to our MySQL database and stream new results back. This one returns POJOs.
- Then we flatMap these streams to unify their Type. They are now all Tuple3s with matching types.
- And we process each stream with the same ProcessFunction.
- I am stuck with the output list.
Business case (human translation workflow):
- Input: Stream "translation quality" score updates of each translator [translator_id, score]
- Input: Stream "responsivity score" updates of each translator (email open rates/speeds etc) [translator_id, score]
- Input: Stream "number of projects" updates each translator worked on [translator_id, score]
- Calculation: for each translator, use 3 scores to come up with a unified score and its percentile over all translators. This step definitely feels like a Batch job, but I am pushing to go with a streaming mindset.
- So now supposedly, in this way or another, I have a list of translators with their unified score and percentile over this list.
- Another independent stream should send me updates on "need for proofreaders" – I couldn't even come to this point yet. Once a need info is streamed, application would fetch the previously calculated list and let's say picks the top X determined by the message from need algorithm.
Overall, my desire is to make everything a stream and let the data and decisions constantly react to stream updates. I am very confused at this point. Tried using keyed and operator states, but they seem to be keeping their state only for their own items. Considering to do Batch instead after all the struggle.
Any ideas? I can even get on a call.
The World's Fastest Human Translation Platform.