flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Diego Fustes Villadóniga <dfus...@oesia.com>
Subject RE: Calling external services/databases from DataStream API
Date Tue, 31 Jan 2017 07:12:34 GMT
Hi Stephan,

Thanks a lot for your response. I’ll study the options that you mention, I’m not sure
if the “chagelog stream” will be easy to implement since the lookup is based on matching
IP ranges and not just keys.



De: Stephan Ewen [mailto:sewen@apache.org]
Enviado el: lunes, 30 de enero de 2017 17:39
Para: user@flink.apache.org
Asunto: Re: Calling external services/databases from DataStream API


The Distributed cache would actually indeed be nice to add to the DataStream API. Since the
runtime parts for that are all in place, the code would be mainly on the "client" side that
sets up the JobGraph to be submitted and executed.

For the problem of scaling this, there are two solutions that I can see:

(1) Simpler: Use the new asynchronous I/O operator to talk with the external database in an
asynchronous fashion (that should help to get higher throughput) https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/asyncio.html

(2) More elaborate: Convert the lookup database into a "changelog stream" and make the enrichment
operation a "stream-to-stream" join.


On Mon, Jan 30, 2017 at 1:36 PM, Jonas <jonas@huntun.de<mailto:jonas@huntun.de>>
I have a similar usecase where I (for the purposes of this discussion) have a
GeoIP Database that is not fully available from the start but will
eventually be "full". The GeoIP tuples are coming in one after another.
After ~4M tuples the GeoIP database is complete.

I also need to do the same query.

View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Calling-external-services-databases-from-DataStream-API-tp11366p11367.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

View raw message