flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Till Rohrmann <trohrm...@apache.org>
Subject Re: Best way to process data in many files? (FLINK-BATCH)
Date Wed, 24 Feb 2016 11:24:06 GMT
Hi Tim,

unfortunately, this is not documented explicitly as far as I know. For the
InputFormats there is a marker interface called NonParallelInput. The input
formats which implement this interface will be executed with a parallelism
of 1. At the moment this holds true for the CollectionInputFormat,
IteratorInputFormat and the JDBCInputFormat.

I hope this helps.

Cheers,
Till
​

On Tue, Feb 23, 2016 at 3:44 PM, Tim Conrad <conrad@math.fu-berlin.de>
wrote:

> Hi Till (and others).
>
> Thank you very much for your helpful answer.
>
> On 23.02.2016 14:20, Till Rohrmann wrote:
>
> [...] In contrast, if you had a parallel data source which would consist
> of multiple source task, then these tasks would be independent and spread
> out across your cluster [...]
>
>
> Can you please send me a link to an example or to the respective Flink API
> doc, where I can see which is a parallel data source and how to create it
> with multiple source tasks?
>
> A simple Google search did not provide me with an answer (maybe I used the
> wrong key words, though...).
>
>
> Cheers
> Tim
>
>
>
>
> On 23.02.2016 14:20, Till Rohrmann wrote:
>
> Hi Tim,
>
> depending on how you create the DataSource<String> fileList, Flink will
> schedule the downstream operators differently. If you used the
> ExecutionEnvironment.fromCollection method, then it will create a
> DataSource with a CollectionInputFormat. This kind of DataSource will
> only be executed with a degree of parallelism of 1. The source will send
> it’s collection elements in a round robin fashion to the downstream
> operators which are executed with a higher parallelism. So when Flink
> schedules the downstream operators, it will try to place them close to
> their inputs. Since all flat map operators have the single data source task
> as an input, they will be deployed on the same machine if possible.
>
> In contrast, if you had a parallel data source which would consist of
> multiple source task, then these tasks would be independent and spread out
> across your cluster. In this case, every flat map task would have a single
> distinct source task as input. When the flat map tasks are deployed they
> would be deployed on the machine where their corresponding source is
> running. Since the source tasks are spread out across the cluster, the flat
> map tasks would be spread out as well.
>
> What you could do to mitigate your problem is to start the cluster with as
> many slots as your maximum degree of parallelism is. That way, you’ll
> utilize all cluster resources.
>
> I hope this clarifies a bit why you observe that tasks tend to cluster on
> a single machine.
>
> Cheers,
> Till
> ​
>
>
>

Mime
View raw message