flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-8532) RebalancePartitioner should use Random value for its first partition
Date Thu, 23 Aug 2018 17:20:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16590561#comment-16590561

ASF GitHub Bot commented on FLINK-8532:

Guibo-Pan commented on issue #6544: [FLINK-8532] [Streaming] modify RebalancePartitioner to
use a random partition as its first partition
URL: https://github.com/apache/flink/pull/6544#issuecomment-415499383
   Hi @StephanEwen , your suggestion lead me to deep thinks, and the extreme performance is
exactly what we want. I am going to ask you for more suggestions. I prefer to initialize the
partitioner instance with a random partition, however in the design ahead, the partitioner
doesn't know the target range.
   The alternative is like this:
   private final int[] returnArray = new int[] {new Random().nextInt(Integer.MAX_VALUE - 1)};
   public int[] selectChannels(SerializationDelegate<StreamRecord<T>> record,
   		int numberOfOutputChannels) {
   	this.returnArray[0] = (this.returnArray[0] + 1) % numberOfOutputChannels;
   	return this.returnArray;
    Please tell me how you think, thanks.

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

> RebalancePartitioner should use Random value for its first partition
> --------------------------------------------------------------------
>                 Key: FLINK-8532
>                 URL: https://issues.apache.org/jira/browse/FLINK-8532
>             Project: Flink
>          Issue Type: Improvement
>          Components: DataStream API
>            Reporter: Yuta Morisawa
>            Assignee: Guibo Pan
>            Priority: Major
>              Labels: pull-request-available
> In some conditions, RebalancePartitioner doesn't balance data correctly because it
use the same value for selecting next operators.
> RebalancePartitioner initializes its partition id using the same value in every threads,
so it indeed balances data, but at one moment the amount of data in each operator is skew.
> Particularly, when the data rate of  former operators is equal , data skew becomes severe.
> Example:
> Consider a simple operator chain.
> -> map1 -> rebalance -> map2 ->
> Each map operator(map1, map2) contains three subtasks(subtask 1, 2, 3, 4, 5, 6).
> map1          map2
>  st1              st4
>  st2              st5
>  st3              st6
> At the beginning, every subtasks in map1 sends data to st4 in map2 because they use the
same initial parition id.
> Next time the map1 receive data st1,2,3 send data to st5 because they increment its partition
id when they processed former data.
> In my environment,  it takes twice the time to process data when I use RebalancePartitioner 
as long as I use other partitioners(rescale, keyby).
> To solve this problem, in my opinion, RebalancePartitioner should use its own operator
id for the initial value.

This message was sent by Atlassian JIRA

View raw message