flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabian Hueske <fhue...@gmail.com>
Subject Re: Parallelism question
Date Thu, 16 Apr 2015 13:35:46 GMT
Hi Giacomo,

as Max said, you can sort the data within a partition.

However, data across partitions is not sorted. It is either random
partitioned or hash-partitioned (all records that share some property are
in the same partition). Producing fully ordered output, where the first
partition has all values 1-10, the second 11-20, and so on, requires range
partitioning which is not yet supported by Flink.

2015-04-14 6:22 GMT-05:00 Maximilian Michels <mxm@apache.org>:

> Hi Giacomo,
>
> If you use a FileOutputFormat as a DataSink (e.g. as in
> env.writeAsText("/path"), then the output directory will contain 5 files
> named 1, 2, 3, 4, and 5, each containing the output of the corresponding
> task. The order of the data in the files follows the order of the
> distributed DataSet. You can locally sort a partition by a key using
> sortPartition(..) command. This is only available in 0.9.0-milestone-1 and
> 0.9-snapshot.
>
> Best,
> Max
>
>
>
> On Tue, Apr 14, 2015 at 12:12 PM, Giacomo Licari <giacomo.licari@gmail.com
> > wrote:
>
>> Hi Max,
>> thank you for your reply.
>>
>> DataSink contains data ordered, I mean, it contains in order output1,
>> output1 ... output5? Or are them mixed?
>>
>> Thanks a lot,
>> Giacomo
>>
>> On Tue, Apr 14, 2015 at 11:58 AM, Maximilian Michels <mxm@apache.org>
>> wrote:
>>
>>> Hi Giacomo,
>>>
>>> If I understand you correctly, you want your Flink job to execute with a
>>> parallelism of 5. Just call setDegreeOfParallelism(5) on your
>>> ExecutionEnvironment. That way, all operations, when possible, will be
>>> performed using 5 parallel instances. This is also true for the DataSink
>>> which will produce 5 files containing the output data from the parallel
>>> instances.
>>>
>>> Best,
>>> Max
>>>
>>>
>>> On Tue, Apr 14, 2015 at 10:38 AM, Giacomo Licari <
>>> giacomo.licari@gmail.com> wrote:
>>>
>>>> Hi guys,
>>>> I have a question about how parallelism works.
>>>>
>>>> If I have a large dataset and I would divide it into 5 blocks, can I
>>>> pass each block of data to a fixed parallel process (for example I set up
5
>>>> process) ?
>>>>
>>>> And if the results data from each process arrive to the output not in
>>>> an ordered way, can I order them? For example:
>>>>
>>>> data from process 1
>>>> data from process 2
>>>> and so on
>>>>
>>>> Thank you guys!
>>>>
>>>
>>>
>>
>

Mime
View raw message