flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Ewen <se...@apache.org>
Subject Re: Calling external services/databases from DataStream API
Date Mon, 30 Jan 2017 16:38:34 GMT
Hi!

The Distributed cache would actually indeed be nice to add to the
DataStream API. Since the runtime parts for that are all in place, the code
would be mainly on the "client" side that sets up the JobGraph to be
submitted and executed.

For the problem of scaling this, there are two solutions that I can see:

(1) Simpler: Use the new asynchronous I/O operator to talk with the
external database in an asynchronous fashion (that should help to get
higher throughput)
https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/asyncio.html

(2) More elaborate: Convert the lookup database into a "changelog stream"
and make the enrichment operation a "stream-to-stream" join.

Greetings,
Stephan


On Mon, Jan 30, 2017 at 1:36 PM, Jonas <jonas@huntun.de> wrote:

> I have a similar usecase where I (for the purposes of this discussion)
> have a
> GeoIP Database that is not fully available from the start but will
> eventually be "full". The GeoIP tuples are coming in one after another.
> After ~4M tuples the GeoIP database is complete.
>
> I also need to do the same query.
>
>
>
> --
> View this message in context: http://apache-flink-user-
> mailing-list-archive.2336050.n4.nabble.com/Calling-
> external-services-databases-from-DataStream-API-tp11366p11367.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>

Mime
View raw message