flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fabian Hueske (JIRA)" <j...@apache.org>
Subject [jira] [Created] (FLINK-1060) Support explicit shuffling of DataSets
Date Wed, 20 Aug 2014 19:52:27 GMT
Fabian Hueske created FLINK-1060:
------------------------------------

             Summary: Support explicit shuffling of DataSets
                 Key: FLINK-1060
                 URL: https://issues.apache.org/jira/browse/FLINK-1060
             Project: Flink
          Issue Type: Improvement
          Components: Java API, Optimizer
            Reporter: Fabian Hueske
            Assignee: Fabian Hueske
            Priority: Minor


Right now, Flink only shuffles data if it is required by some operation such as Reduce, Join,
or CoGroup. There is no way to explicitly shuffle a data set.

However, in some situations explicit shuffling would be very helpful including:
- rebalancing before compute-intensive Map operations
- balancing, random or hash partitioning before PartitionMap operations (see FLINK-1053)
- better integration of support for HadoopJobs (see FLINK-838)

With this issue, I propose to add the following methods to {{DataSet}}
- {{DataSet.partitionHashBy(int...)}} and {{DataSet.partitionHashBy(KeySelector)}} to perform
an explicit hash partitioning
- {{DataSet.partitionRandomly()}} to shuffle data completely random
- {{DataSet.partitionRoundRobin()}} to shuffle data in a round-robin fashion that generates
very even distribution with possible bias due to prior distributions

The {{DataSet.partitionRoundRobin()}} might not be necessary if we think that random shuffling
balances good enough.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message