flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From CPC <acha...@gmail.com>
Subject Re: Data locality and scheduler
Date Tue, 26 Apr 2016 16:17:12 GMT
Hi

But isnt this behaviour can cause a lot of network activity? Is there any
roadmap or plan to change this behaviour?
On Apr 26, 2016 7:06 PM, "Fabian Hueske" <fhueske@gmail.com> wrote:

> Hi,
>
> Flink starts four tasks and then lazily assigns input splits to these tasks
> with locality preference. So each task may consume more than one split.
> This is different from Hadoop MapReduce or Spark which schedule a new task
> for each input split.
> In your case, the four tasks would be scheduled to four of the 40 machines
> and most of the splits will be remotely read.
>
> Best, Fabian
>
>
> 2016-04-26 16:59 GMT+02:00 CPC <achalil@gmail.com>:
>
> > Hi,
> >
> > I look at some scheduler documentations but could not find answer to my
> > question. My question is: suppose that i have a big file on 40 node
> hadoop
> > cluster and since it is a big file every node has at least one chunk of
> the
> > file. If i write a flink job and want to filter file and if job has
> > parelelism of 4(less that 40 actually) how datalocality is working? Does
> > some tasks read some chunks from remote nodes? Or scheduler schedule
> tasks
> > in way that keeping max paralelism at 4 but schedule tasks on every node?
> >
> > Regards
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message