flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabian Hueske <fhue...@gmail.com>
Subject Re: dataset sort
Date Fri, 09 Feb 2018 10:10:12 GMT
You can also partition by range and sort and write each partition. Once all
partitions have been written to files, you can concatenate the files.
As Till said it is not possible to sort in parallel and write in order to a
single file.

Best, Fabian

2018-02-09 10:35 GMT+01:00 Till Rohrmann <trohrmann@apache.org>:

> Hi David,
>
> Flink only supports sorting within partitions. Thus, if you want to write
> out a globally sorted dataset you should set the parallelism to 1 which
> effectively results in a single partition. Decreasing the parallelism of
> an operator will cause the individual partitions to lose its sort order
> because the individual partitions are read in a non deterministic order.
>
> Cheers,
> Till
>
>
> On Thu, Feb 8, 2018 at 8:07 PM, david westwood <david.d.westwood@gmail.com
> > wrote:
>
>> Hi:
>>
>> I would like to sort historical data using the dataset api.
>>
>> env.setParallelism(10)
>>
>> val dataset = [(Long, String)] ..
>> .paritionByRange(_._1)
>> .sortPartition(_._1, Order.ASCEDING)
>> .writeAsCsv("mydata.csv").setParallelism(1)
>>
>> the data is out of order (in local order)
>> but
>> .print()
>> prints the data in to correct order. I have run a small toy sample
>> multiple times.
>>
>> Is there a way to sort the entire dataset with parallelism > 1 and write
>> it to a single file in ascending order?
>>
>
>

Mime
View raw message