flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gwenhael Pasquiers <gwenhael.pasqui...@ericsson.com>
Subject Streaming : a way to "key by partition id" without redispatching data
Date Thu, 09 Nov 2017 17:00:13 GMT
Hello,

(Flink 1.2.1)

For performances reasons I'm trying to reduce the volume of data of my stream as soon as possible
by windowing/folding it for 15 minutes before continuing to the rest of the chain that contains
keyBys and windows that will transfer data everywhere.

Because of the huge volume of data, I want to avoid "moving" the data between partitions as
much as possible (not like a naïve KeyBy does). I wanted to create a custom ProcessFunction
(using timer and state to fold data for X minutes) in order to fold my data over itself before
keying the stream but even ProcessFunction needs a keyed stream...

Is there a specific "key" value that would ensure me that my data won't be moved to another
taskmanager (that it's hashcode will match the partition it is already in) ? I thought about
the subtask id but I doubt I'd be that lucky :-)

Suggestions

·         Wouldn't it be useful to be able to do a "partitionnedKeyBy" that would not move
data between nodes, for windowing operations that can be parallelized.

o   Something like kafka => partitionnedKeyBy(0) => first folding => keyBy(0) =>
second folding => ....

·         Finally, aren't all streams keyed ? Even if they're keyed by a totally arbitrary
partition id until the user chooses its own key, shouldn't we be able to do a window (not
windowAll) or process over any normal Stream's partition ?

B.R.

Gwenhaël PASQUIERS

Mime
View raw message