flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabian Hueske <fhue...@gmail.com>
Subject Re: Informing the runtime about data already repartitioned using "output contracts"
Date Mon, 18 May 2015 15:42:54 GMT
Hi Mustafa,

I'm afraid, this is not possible.
Although you can annotate DataSources with partitioning information, this
is not enough to avoid repartitioning for a CoGroup. The reason for that is
that CoGroup requires co-partitioning of both inputs, i.e., both inputs
must be equally partitioned (same number of partitions, same partitioning
function, same location of partitions). Since Flink is dynamically
assigning tasks to execution slots, it is not possible to co-locate data
that was read from a data source and data coming from the result of another
computation.

If you just need the result of the first co-group on disk, you could also
build a single program that does both co-groups and additional writes the
result of the first co-group to disk (Flink supports multiple data sinks).

Best, Fabian

2015-05-18 15:43 GMT+02:00 Mustafa Elbehery <elbeherymustafa@gmail.com>:

> Hi,
>
> I am writing a flink job, in which I have three datasets.  I have
> partitionedByHash the first two before coGrouping them.
>
> My plan is to spill the result of coGrouping to disk, and then re-read it
> again before coGrouping with the third dataset.
>
> My question is, is there anyway to inform flink that the first coGroup
> result is already partitioned ?!  I know I can re-partition again before
> coGrouping but I would like to know if there is anyway to avoid a step
> which was already executed,
>
> Regards.
>
> --
> Mustafa Elbehery
> EIT ICT Labs Master School <http://www.masterschool.eitictlabs.eu/home/>
> +49(0)15750363097
> skype: mustafaelbehery87
>
>

Mime
View raw message