flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nico Kruber <n...@data-artisans.com>
Subject Re: Streaming : a way to "key by partition id" without redispatching data
Date Mon, 13 Nov 2017 13:25:55 GMT
Hi Gwenhaël,
several functions in Flink require keyed streams because they manage their 
internal state by key. These keys, however, should be independent of the 
current execution and its parallelism so that checkpoints may be restored to 
different levels of parallelism (for re-scaling, see [1]).
Also, different operators, e.g. the source vs. the map, may have a different 
number of parallel tasks in which case you'd need to shuffle the data in order 
to adapt. The same goes for possible differences in the parallelism of the 
Kafka partitions vs. the parallelism you use in Flink.

If, however, all your operators have the same parallelism, doing multiple 
keyBy(0) calls in your program will not re-shuffle the data, because of the 
deterministic assignment of keys to operators.


[1] https://flink.apache.org/features/2017/07/04/flink-rescalable-state.html
On Thursday, 9 November 2017 18:00:13 CET Gwenhael Pasquiers wrote:
> Hello,
> (Flink 1.2.1)
> For performances reasons I'm trying to reduce the volume of data of my
> stream as soon as possible by windowing/folding it for 15 minutes before
> continuing to the rest of the chain that contains keyBys and windows that
> will transfer data everywhere.
> Because of the huge volume of data, I want to avoid "moving" the data
> between partitions as much as possible (not like a naïve KeyBy does). I
> wanted to create a custom ProcessFunction (using timer and state to fold
> data for X minutes) in order to fold my data over itself before keying the
> stream but even ProcessFunction needs a keyed stream...
> Is there a specific "key" value that would ensure me that my data won't be
> moved to another taskmanager (that it's hashcode will match the partition
> it is already in) ? I thought about the subtask id but I doubt I'd be that
> lucky :-)
> Suggestions
> ·         Wouldn't it be useful to be able to do a "partitionnedKeyBy" that
> would not move data between nodes, for windowing operations that can be
> parallelized.
> o   Something like kafka => partitionnedKeyBy(0) => first folding =>
> keyBy(0) => second folding => ....
> ·         Finally, aren't all streams keyed ? Even if they're keyed by a
> totally arbitrary partition id until the user chooses its own key,
> shouldn't we be able to do a window (not windowAll) or process over any
> normal Stream's partition ?
> B.R.
> Gwenhaël PASQUIERS

View raw message