flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-1725) New Partitioner for better load balancing for skewed data
Date Wed, 02 Sep 2015 14:16:45 GMT

    [ https://issues.apache.org/jira/browse/FLINK-1725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14727399#comment-14727399
] 

ASF GitHub Bot commented on FLINK-1725:
---------------------------------------

Github user gdfm commented on the pull request:

    https://github.com/apache/flink/pull/1069#issuecomment-137096426
  
    @tillrohrmann the reason why 2 is a "magic number" is the following.
    Immagine channels as containers, and the load (the incoming data stream) as a liquid.
    When a key is split in two channels, it creates a "bridge" between these channels.
    You can imagine it as a pipe between the two specific containers that enables sharing
the load, and brings the liquid to the same level (this happens thanks to the fact that the
new messages are sent to the least loaded of the two containers).
    Now, if you have enough of these pipes between pairs of containers, you will effectively
establish a network of load sharing among them. Any increase in pressure on one of the containers
can be shared across the network effectively, which brings the load balance.
    The trick is to have "enough" keys to create enough pipes.


> New Partitioner for better load balancing for skewed data
> ---------------------------------------------------------
>
>                 Key: FLINK-1725
>                 URL: https://issues.apache.org/jira/browse/FLINK-1725
>             Project: Flink
>          Issue Type: Improvement
>          Components: New Components
>    Affects Versions: 0.8.1
>            Reporter: Anis Nasir
>            Assignee: Anis Nasir
>              Labels: LoadBalancing, Partitioner
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Hi,
> We have recently studied the problem of load balancing in Storm [1].
> In particular, we focused on key distribution of the stream for skewed data.
> We developed a new stream partitioning scheme (which we call Partial Key Grouping). It
achieves better load balancing than key grouping while being more scalable than shuffle grouping
in terms of memory.
> In the paper we show a number of mining algorithms that are easy to implement with partial
key grouping, and whose performance can benefit from it. We think that it might also be useful
for a larger class of algorithms.
> Partial key grouping is very easy to implement: it requires just a few lines of code
in Java when implemented as a custom grouping in Storm [2].
> For all these reasons, we believe it will be a nice addition to the standard Partitioners
available in Flink. If the community thinks it's a good idea, we will be happy to offer support
in the porting.
> References:
> [1]. https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
> [2]. https://github.com/gdfm/partial-key-grouping



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message