flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabian Hueske <fhue...@gmail.com>
Subject Re: Strange behavior of DataStream.countWindow
Date Fri, 23 Jun 2017 16:29:25 GMT
No, you will lose data locality if you use keyBy(), which is the only way
to obtain a KeyedStream.

2017-06-23 17:52 GMT+02:00 Edward <egbuck@hotmail.com>:

> Thanks, Fabian.
> In this case, I could just extend your idea by creating some deterministic
> multiplier of the subtask index:
>
>       RichMapFunction<String, Tuple2<Integer,String>> keyByMap = new
> RichMapFunction<String, Tuple2<Integer,String>>() {
>               public Tuple2<Integer,String> map(String value) {
>                 int indexOfCounter = Math.abs(value.hashCode()) % 4;
>                 int key = (( getRuntimeContext().getIndexOfThisSubtask() +
> 1)  * (indexOfCounter + 1)) - 1;
>                 counters.get(key).add(1);
>                 return new Tuple2<>(key, value);
>             }
>         };
>
> With this idea, if there are 12 subtasks, then subtask 0 would create 4
> keys: 0, 12, 24, and 36.
>
> The big advantage of your idea was that it would keep the data local. Is
> this still true with my example here (where I'm applying a function to the
> subtask index)? That is, if a each partition is generating a unique set of
> keys (unique to that subtask), will it optimize to keep that set of keys
> local for the next downstream subtask?
>
>
>
> --
> View this message in context: http://apache-flink-user-
> mailing-list-archive.2336050.n4.nabble.com/Strange-behavior-of-DataStream-
> countWindow-tp7482p13978.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>

Mime
View raw message