flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Ertl <peter.e...@gmx.net>
Subject load + update global state
Date Mon, 07 Aug 2017 19:00:49 GMT
Hi folks,

I am coding a streaming task that processes http requests from our web site and enriches these
with additional information.

It contains session ids from historic requests and the related emails that were used within
these session in the past.


    lookup - hashtable:     session_id: String => emails: Set[String]


During processing of these NEW http request

- the lookup table should be used to get previous emails and enrich the current stream item
- new candidates for the lookup table will be discovered during processing of these items
and should be added to the lookup table (also these changes should be visible through the
cluster)

I see at least the following issues:

(1) load the state as a whole from the data store into memory is a huge burn of memory (also
making changes cluster-wide visible is an issue)

(2) not loading into memory but using something like cassandra / redis as a lookup store would
certainly work but introduces a lot of network requests (possible ideas: use a distributed
cache? broadcast updates in flink cluster?)

(3) how should I integrate the changes to the table with flink's checkpointing?

I really don't get how to solve this best and my current solution is far from elegant....


So is there any best practice for supporting "large lookup tables that change during stream
processing" ?

Cheers
Peter





Mime
View raw message