flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabian Hueske <fhue...@gmail.com>
Subject Re: Strange behavior of DataStream.countWindow
Date Fri, 23 Jun 2017 15:45:07 GMT
Flink hashes the keys and computes the target partition using modulo. This
works well, if you have many keys but leads to skew if the number of keys
is close to the number of partitions.
If you use parittionCustom, you can explicitly define the target partition,
however, partitionCustom does not return a KeyedStream, so you cannot use
keyed state or windows there.

Not sure if that works for your usecase, but you could try to use more keys
to achieve a more uniform key distribution.

Best, Fabian

2017-06-23 15:34 GMT+02:00 Edward <egbuck@hotmail.com>:

> Hi Fabian -
> I've tried this idea of creating a KeyedStream based on
> getRuntimeContext().getIndexOfThisSubtask(). However, not all target
> subtasks are receiving records.
>
> All subtasks have a parallelism of 12, so I have 12 source subtasks and 12
> target subtasks. I've confirmed that the call to getIndexOfThisSubtask is
> evenly distributed between 0 and 11. However, 4 out of the 12 target
> subtasks (the subtasks after the hash) are no receiving any data. This
> means
> it's not actually keeping all the data local, because at least 4 of the 12
> partitions could be getting sent to different TaskManagers.
>
> Do I need to do a .partitionCustom to ensure even/local distribution?
>
> Thanks,
> Edward
>
>
>
> --
> View this message in context: http://apache-flink-user-
> mailing-list-archive.2336050.n4.nabble.com/Strange-behavior-of-DataStream-
> countWindow-tp7482p13971.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>

Mime
View raw message