flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From qi luo <luoqi...@gmail.com>
Subject Set partition number of Flink DataSet
Date Mon, 11 Mar 2019 12:42:06 GMT
Hi,

We’re trying to distribute batch input data to (N) HDFS files partitioning by hash using
DataSet API. What I’m doing is like:

env.createInput(…)
      .partitionByHash(0)
      .setParallelism(N)
      .output(…)

This works well for small number of files. But when we need to distribute to large number
of files (say 100K), the parallelism becomes too large and we could not afford that many TMs.

In spark we can write something like ‘rdd.partitionBy(N)’ and control the parallelism
separately (using dynamic allocation). Is there anything similar in Flink or other way we
can achieve similar result? Thank you!

Qi
Mime
View raw message