flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Conrad <con...@math.fu-berlin.de>
Subject Re: Best way to process data in many files? (FLINK-BATCH)
Date Tue, 23 Feb 2016 14:44:35 GMT
Hi Till (and others).

Thank you very much for your helpful answer.

On 23.02.2016 14:20, Till Rohrmann wrote:
> [...] In contrast, if you had a parallel data source which would 
> consist of multiple source task, then these tasks would be independent 
> and spread out across your cluster [...]

Can you please send me a link to an example or to the respective Flink 
API doc, where I can see which is a parallel data source and how to 
create it with multiple source tasks?

A simple Google search did not provide me with an answer (maybe I used 
the wrong key words, though...).


Cheers
Tim




On 23.02.2016 14:20, Till Rohrmann wrote:
>
> Hi Tim,
>
> depending on how you create the |DataSource<String> fileList|, Flink 
> will schedule the downstream operators differently. If you used the 
> |ExecutionEnvironment.fromCollection| method, then it will create a 
> |DataSource| with a |CollectionInputFormat|. This kind of |DataSource| 
> will only be executed with a degree of parallelism of 1. The source 
> will send it’s collection elements in a round robin fashion to the 
> downstream operators which are executed with a higher parallelism. So 
> when Flink schedules the downstream operators, it will try to place 
> them close to their inputs. Since all flat map operators have the 
> single data source task as an input, they will be deployed on the same 
> machine if possible.
>
> In contrast, if you had a parallel data source which would consist of 
> multiple source task, then these tasks would be independent and spread 
> out across your cluster. In this case, every flat map task would have 
> a single distinct source task as input. When the flat map tasks are 
> deployed they would be deployed on the machine where their 
> corresponding source is running. Since the source tasks are spread out 
> across the cluster, the flat map tasks would be spread out as well.
>
> What you could do to mitigate your problem is to start the cluster 
> with as many slots as your maximum degree of parallelism is. That way, 
> you’ll utilize all cluster resources.
>
> I hope this clarifies a bit why you observe that tasks tend to cluster 
> on a single machine.
>
> Cheers,
> Till
>
> ​
>


Mime
View raw message