flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gábor Gévay <gga...@gmail.com>
Subject Re: Best way to process data in many files? (FLINK-BATCH)
Date Wed, 24 Feb 2016 15:01:56 GMT
Hello,

> // For each "filename" in list do...
> DataSet<FeatureList> featureList = fileList
>                 .flatMap(new ReadDataSetFromFile()) // flatMap because there
> might multiple DataSets in a file

What happens if you just insert .rebalance() before the flatMap?

> This kind of DataSource will only be executed
> with a degree of parallelism of 1. The source will send it’s collection
> elements in a round robin fashion to the downstream operators which are
> executed with a higher parallelism. So when Flink schedules the downstream
> operators, it will try to place them close to their inputs. Since all flat
> map operators have the single data source task as an input, they will be
> deployed on the same machine if possible.

Sorry, I'm a little confused here. Do you mean that the flatMap will
have a high parallelism, but all instances on a single machine?
Because I tried to reproduce the situation where I have a non-parallel
data source and then a flatMap, and the plan shows that the flatMap
actually has parallelism 1, which would be an alternative explanation
to the original problem that it gets executed on a single machine.
Then, if I insert .rebalance() after the source, then a "Partition"
operation appears between the source and the flatMap, and the flatMap
has a high parallelism. I think this should also solve the problem,
without having to write a parallel data source.

Best,
Gábor

Mime
View raw message