flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ovidiu-Cristian MARCU <ovidiu-cristian.ma...@inria.fr>
Subject Re: Not enough free slots to run the job
Date Mon, 21 Mar 2016 14:30:24 GMT
Hi Robert,

I am not sure I understand so please confirm if I understand correctly your suggestions:
- to use less slots than available slots capacity to avoid issues like when a TaskManager
is not giving its slots because of some problems registering the TM;
(This means I will lose some performance by not using all the available capacity)
-if a job is failing because of losing a TaskManager (and its slots) the job will not restart
even if available slots are free to use.
(for this case the ‘spare slots’ will not be of help right; losing a TM means the job
will fail, no recovery)

Thanks!

Best,
Ovidiu


> On 21 Mar 2016, at 14:15, Robert Metzger <rmetzger@apache.org> wrote:
> 
> Hi Ovidiu,
> 
> right now the scheduler in Flink will not use more slots than requested.
> To avoid issues on recovery, we usually recommend users to have some spare slots (run
job with p=15 on a cluster with slots=20). I agree that it would make sense to add a flag
which allows a job to grab more slots if they are available. The problem with that is however,
that jobs can currently not change their parallelism. So if a job fails, it can not downscale
to restart on the remaining slots.
> That's why the spare slots approach is currently the only way to go.
> 
> Regards,
> Robert
> 
> On Fri, Mar 18, 2016 at 1:30 PM, Ovidiu-Cristian MARCU <ovidiu-cristian.marcu@inria.fr
<mailto:ovidiu-cristian.marcu@inria.fr>> wrote:
> Hi,
> 
> For the situation where a program specify a maximum parallelism (so it is supposed to
use all available task slots) we can have the possibility that one of the task managers is
not registered for various reasons.
> In this case the job will fail for not enough free slots to run the job.
> 
> For me this means the scheduler has a limitation to work by statically assign tasks to
the task slots the job is configured.
> 
> Instead I would like to be able to specify a minimum parallelism of a job but also the
possibility to dynamically use more task slots if additional task slots can be used.
> Another use case will be that if during the execution of a job we lose one node so some
task slots, if the minimum parallelism is still ensured, the job should recover and continue
its execution instead of just failing.
> 
> Is it possible to make such changes?
> 
> Best,
> Ovidiu
> 


Mime
View raw message