Return-Path: X-Original-To: apmail-flink-user-archive@minotaur.apache.org Delivered-To: apmail-flink-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E1A3A1755D for ; Tue, 14 Apr 2015 11:22:42 +0000 (UTC) Received: (qmail 53422 invoked by uid 500); 14 Apr 2015 11:22:42 -0000 Delivered-To: apmail-flink-user-archive@flink.apache.org Received: (qmail 53352 invoked by uid 500); 14 Apr 2015 11:22:42 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 53343 invoked by uid 99); 14 Apr 2015 11:22:42 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Apr 2015 11:22:42 +0000 Received: from mail-lb0-f176.google.com (mail-lb0-f176.google.com [209.85.217.176]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 15A8D1A003F for ; Tue, 14 Apr 2015 11:22:42 +0000 (UTC) Received: by lbbuc2 with SMTP id uc2so5435928lbb.2 for ; Tue, 14 Apr 2015 04:22:40 -0700 (PDT) X-Gm-Message-State: ALoCoQmPn20vGkOVU+uEb1/FZ+dFINZOv/lNvVLi3E8MLxyKYnhRVkQHeGg07MtrVq+Sb9w5FU3N X-Received: by 10.112.157.100 with SMTP id wl4mr17894310lbb.108.1429010560620; Tue, 14 Apr 2015 04:22:40 -0700 (PDT) MIME-Version: 1.0 Received: by 10.25.31.15 with HTTP; Tue, 14 Apr 2015 04:22:19 -0700 (PDT) In-Reply-To: References: From: Maximilian Michels Date: Tue, 14 Apr 2015 13:22:19 +0200 Message-ID: Subject: Re: Parallelism question To: "user@flink.apache.org" Content-Type: multipart/alternative; boundary=001a11c2629cd8b4510513ad7274 --001a11c2629cd8b4510513ad7274 Content-Type: text/plain; charset=UTF-8 Hi Giacomo, If you use a FileOutputFormat as a DataSink (e.g. as in env.writeAsText("/path"), then the output directory will contain 5 files named 1, 2, 3, 4, and 5, each containing the output of the corresponding task. The order of the data in the files follows the order of the distributed DataSet. You can locally sort a partition by a key using sortPartition(..) command. This is only available in 0.9.0-milestone-1 and 0.9-snapshot. Best, Max On Tue, Apr 14, 2015 at 12:12 PM, Giacomo Licari wrote: > Hi Max, > thank you for your reply. > > DataSink contains data ordered, I mean, it contains in order output1, > output1 ... output5? Or are them mixed? > > Thanks a lot, > Giacomo > > On Tue, Apr 14, 2015 at 11:58 AM, Maximilian Michels > wrote: > >> Hi Giacomo, >> >> If I understand you correctly, you want your Flink job to execute with a >> parallelism of 5. Just call setDegreeOfParallelism(5) on your >> ExecutionEnvironment. That way, all operations, when possible, will be >> performed using 5 parallel instances. This is also true for the DataSink >> which will produce 5 files containing the output data from the parallel >> instances. >> >> Best, >> Max >> >> >> On Tue, Apr 14, 2015 at 10:38 AM, Giacomo Licari < >> giacomo.licari@gmail.com> wrote: >> >>> Hi guys, >>> I have a question about how parallelism works. >>> >>> If I have a large dataset and I would divide it into 5 blocks, can I >>> pass each block of data to a fixed parallel process (for example I set up 5 >>> process) ? >>> >>> And if the results data from each process arrive to the output not in an >>> ordered way, can I order them? For example: >>> >>> data from process 1 >>> data from process 2 >>> and so on >>> >>> Thank you guys! >>> >> >> > --001a11c2629cd8b4510513ad7274 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Giacomo,

If you use a FileOutput= Format as a DataSink (e.g. as in env.writeAsText("/path"), then t= he output directory will contain 5 files named 1, 2, 3, 4, and 5, each cont= aining the output of the corresponding task. The order of the data in the f= iles follows the order of the distributed DataSet. You can locally sort a p= artition by a key using sortPartition(..) command. This is only available i= n 0.9.0-milestone-1 and 0.9-snapshot.

Best,
Max



On Tue, Apr 14, 2015 at 12:12 PM, Giacomo Licari <giacomo.licari@gmail.com> wrote:
Hi Max,
thank you for your reply.

DataSink contains data ordered, I mean, it contains in orde= r output1, output1 ... output5? Or are them mixed?

Thanks a lot,
Giacomo

On Tue, = Apr 14, 2015 at 11:58 AM, Maximilian Michels <mxm@apache.org> = wrote:
Hi= Giacomo,

If I understand you correctly, you want your Flink j= ob to execute with a parallelism of 5. Just call setDegreeOfParallelism(5) = on your ExecutionEnvironment. That way, all operations, when possible, will= be performed using 5 parallel instances. This is also true for the DataSin= k which will produce 5 files containing the output data from the parallel i= nstances.

Best,
Max


On Tue, Apr 14, 2015 at 10:38 AM, Giacomo Licari &l= t;giacomo.lic= ari@gmail.com> wrote:
Hi guys,
I have a question about how parallelism works.

If I have a large dataset and I would divide it into = 5 blocks, can I pass each block of data to a fixed parallel process (for ex= ample I set up 5 process) ?=C2=A0

And if the resul= ts data from each process arrive to the output not in an ordered way, can I= order them? For example:

data from process 1
data from process 2
and so on

Than= k you guys!



--001a11c2629cd8b4510513ad7274--