Hi Tim,

unfortunately, this is not docu= mented explicitly as far as I know. For the InputFormats there is a marker interface called NonParallelInput<= /code>. The input formats which implement this interface will be executed w= ith a parallelism of 1. At the moment this holds true for the CollectionInputFormat, IteratorInputFormat and the JDBCInputFormat.


I hope this helps.
Cheers,
Till
=E2=80=8B


On Tue, F=
eb 23, 2016 at 3:44 PM, Tim Conrad <conrad@math.fu-berlin.de>=
; wrote:

 =20
   =20
 =20
  
    Hi Till (and others).

      

      Thank you very much for your helpful answer.

      

      On 23.02.2016 14:20, Till Rohrmann wrote:

    
    [...] In contrast, if you had a parallel data=
 source
      which would consist of multiple source task, then these tasks
      would be independent and spread out across your cluster [...]
    

    Can you please send me a link to an example or to the respective
    Flink API doc, where I can see which is a parallel data source and
    how to create it with multiple source tasks?

    

    A simple Google search did not provide me with an answer (maybe I
    used the wrong key words, though...).

    

    

    Cheers

    Tim

    

    

    

    

    On 23.02.2016 14:20, Till Rohrmann
      wrote:

    
    
      
        
          Hi Tim,
          depending on how you
            create the DataSource<String> fileList,
            Flink will schedule the downstream operators differently. If
            you used the ExecutionEnvironment.fromCollecti=
on
            method, then it will create a DataSource
            with a CollectionInputFormat.
            This kind of DataSource
            will only be executed with a degree of parallelism of 1. The
            source will send it=E2=80=99s collection elements in a round ro=
bin
            fashion to the downstream operators which are executed with
            a higher parallelism. So when Flink schedules the downstream
            operators, it will try to place them close to their inputs.
            Since all flat map operators have the single data source
            task as an input, they will be deployed on the same machine
            if possible.

          In contrast, if you
            had a parallel data source which would consist of multiple
            source task, then these tasks would be independent and
            spread out across your cluster. In this case, every flat map
            task would have a single distinct source task as input. When
            the flat map tasks are deployed they would be deployed on
            the machine where their corresponding source is running.
            Since the source tasks are spread out across the cluster,
            the flat map tasks would be spread out as well.
          What you could do to
            mitigate your problem is to start the cluster with as many
            slots as your maximum degree of parallelism is. That way,
            you=E2=80=99ll utilize all cluster resources.
          I hope this
            clarifies a bit why you observe that tasks tend to cluster
            on a single machine.
          Cheers,

            Till
          =E2=
=80=8B



--089e013d1a60d2013a052c824dd2--