flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Dreyfus <dddrey...@gmail.com>
Subject Re: Tasks, slots, and partitioned joins
Date Thu, 26 Oct 2017 13:04:54 GMT
Hi Fabian,

Thank you for the great, detailed answers. 
1. So, each parallel slice of the DAG is placed into one slot. The key to
high utilization is many slices of the source data (or the various methods
of repartitioning it). Yes?
2. In batch processing, are slots filled round-robin on task managers, or do
I need to tune the number of slots to load the cluster evenly?
3. Are you suggesting that I perform the join in my custom data source?
4. Looking at this sample from
org.apache.flink.optimizer.PropertyDataSourceTest

  DataSource<Tuple2&lt;Long, String>> data = 
    env.readCsvFile("/some/path").types(Long.class, String.class); 
 
  data.getSplitDataProperties() 
    .splitsPartitionedBy(0); 

4.a Does this code assume that one split == one file from /some/path? If
readCsvFile splits each file, the guarantee that all keys in each part of
the file share the same partition would be violated, right?
4.b Is there a way to mark a partition number so that sources that share
partition numbers are read in parallel and joined? If I have 10,000 pairs, I
want partition 1 read from the sources at the same time.
4.c Does a downstream flatmap function get an open() call for each new
partition? Or, do I chain MapPartition directly to the datasource?

Thank you,
David



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Mime
View raw message