flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gwenhael Pasquiers <gwenhael.pasqui...@ericsson.com>
Subject RE: Streaming : a way to "key by partition id" without redispatching data
Date Fri, 10 Nov 2017 15:02:20 GMT

Finally, even after creating my operator, I still get the error : "Timers can only be used
on keyed operators".

Isn't there any way around this ? A way to "key" my stream without shuffling the data ?

From: Gwenhael Pasquiers
Sent: vendredi 10 novembre 2017 11:42
To: Gwenhael Pasquiers <gwenhael.pasquiers@ericsson.com>; 'user@flink.apache.org' <user@flink.apache.org>
Subject: RE: Streaming : a way to "key by partition id" without redispatching data

Maybe you don't need to bother with that question.

I'm currently discovering AbstractStreamOperator, OneInputStreamOperator and Triggerable.

That should do it :-)

From: Gwenhael Pasquiers [mailto:gwenhael.pasquiers@ericsson.com]
Sent: jeudi 9 novembre 2017 18:00
To: 'user@flink.apache.org' <user@flink.apache.org<mailto:user@flink.apache.org>>
Subject: Streaming : a way to "key by partition id" without redispatching data


(Flink 1.2.1)

For performances reasons I'm trying to reduce the volume of data of my stream as soon as possible
by windowing/folding it for 15 minutes before continuing to the rest of the chain that contains
keyBys and windows that will transfer data everywhere.

Because of the huge volume of data, I want to avoid "moving" the data between partitions as
much as possible (not like a naïve KeyBy does). I wanted to create a custom ProcessFunction
(using timer and state to fold data for X minutes) in order to fold my data over itself before
keying the stream but even ProcessFunction needs a keyed stream...

Is there a specific "key" value that would ensure me that my data won't be moved to another
taskmanager (that it's hashcode will match the partition it is already in) ? I thought about
the subtask id but I doubt I'd be that lucky :-)


·         Wouldn't it be useful to be able to do a "partitionnedKeyBy" that would not move
data between nodes, for windowing operations that can be parallelized.

o   Something like kafka => partitionnedKeyBy(0) => first folding => keyBy(0) =>
second folding => ....

·         Finally, aren't all streams keyed ? Even if they're keyed by a totally arbitrary
partition id until the user chooses its own key, shouldn't we be able to do a window (not
windowAll) or process over any normal Stream's partition ?



View raw message