flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kirschnick, Johannes" <johannes.kirschn...@tu-berlin.de>
Subject AW: Support for Total Order Partitioner ?
Date Wed, 10 Dec 2014 12:57:18 GMT
Hi Fabian,

thanks for the response.
At least I have not missed it yet.

Deriving a good partition key automatically is not easy and the pig solution is a nice workaround.
So at some point this might be interesting to integrate into the whole pipeline,
when there is a way to track statistics of the data as it flows through the pipeline.
Maybe making it possible to approximate the distribution is certain circumstances without
re-reading more data.

I'll revisit this issue when I do have a concrete need that requires ist, other than feature
matching with pig.

Johannes
________________________________________
Von: Fabian Hueske <fhueske@apache.org>
Gesendet: Freitag, 5. Dezember 2014 11:39
An: dev@flink.incubator.apache.org
Betreff: Re: Support for Total Order Partitioner ?

Hi Johannes,

right now the is no straight-forward way to write data in total order in
Flink.

The engine allows to locally sort data before it is given to a data sink.
However, this is not exposed to the API right now but should be fairly easy
to add (FLINK-1105). Given this feature, you can have a total order with a
data sink with DOP=1 (obviously, that doesn't work for large output....).
Flink recently added custom partitioners which can be used for manual range
partitioning if you know the distribution of the sort keys. Unfortunately,
custom partitioners are not documented, yet (FLINK-1260), see the PR for
information: https://github.com/apache/incubator-flink/pull/207.

For automatically fully sorted output, Flink would need to derive the
distribution itself just as you described for PIG. This would be a
non-trivial change.

tldr;
With FLINK-1105 fixed and a custom partitioner (given you know the
distribution), you can have totally ordered output in parallel.

2014-12-04 21:54 GMT+01:00 Kirschnick, Johannes <
johannes.kirschnick@tu-berlin.de>:

> Hi,
>
>
> I have a small question about a custom partitioners.
>
>
> I couldn't really find a method in the Java API which lets me partition
> the dataset in a total order.
>
> Is that something that I just overlooked, or is that something not really
> supported?
>
>
> The use case is writing out data so that it can be consumed by another
> tool. and also for quick human inspection
>
> I can easily apply the sorting on a hash partition, but I would loose the
> total order, that all keys in partition one come after or before partition
> two etc.
>
>
> PIG supports this by mapping the ORDER BY statement to 2 jobs. The first
> one samples the data to build a sketch to approximate the key distribution
> and then feds the result into the total order partitioner.
>
>
> Thanks
>
>
> Johannes
>
Mime
View raw message