Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
From: Ovidiu-Cristian MARCU <ovidiu-cristian.marcu@inria.fr>
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_39783C26-9203-41A4-B755-68A4940B005A"
Message-Id: <12C83C3D-4CAC-4ED3-BD37-689CF06C9B28@inria.fr>
Mime-Version: 1.0 (Mac OS X Mail 9.2 \(3112\))
Subject: Re: Not enough free slots to run the job
Date: Mon, 21 Mar 2016 15:30:24 +0100
References: <8D0BD035-14C6-4E05-919F-9F0692F927F0@inria.fr>
 <CAGr9p8CHPwfdsWvTpQ8wt79utq1HvnAmoyKqLZAddLEZGRM7Jw@mail.gmail.com>
To: user@flink.apache.org
In-Reply-To: 
 <CAGr9p8CHPwfdsWvTpQ8wt79utq1HvnAmoyKqLZAddLEZGRM7Jw@mail.gmail.com>


--Apple-Mail=_39783C26-9203-41A4-B755-68A4940B005A
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8

Hi Robert,

I am not sure I understand so please confirm if I understand correctly =
your suggestions:
- to use less slots than available slots capacity to avoid issues like =
when a TaskManager is not giving its slots because of some problems =
registering the TM;
(This means I will lose some performance by not using all the available =
capacity)
-if a job is failing because of losing a TaskManager (and its slots) the =
job will not restart even if available slots are free to use.
(for this case the =E2=80=98spare slots=E2=80=99 will not be of help =
right; losing a TM means the job will fail, no recovery)

Thanks!

Best,
Ovidiu


> On 21 Mar 2016, at 14:15, Robert Metzger <rmetzger@apache.org> wrote:
>=20
> Hi Ovidiu,
>=20
> right now the scheduler in Flink will not use more slots than =
requested.
> To avoid issues on recovery, we usually recommend users to have some =
spare slots (run job with p=3D15 on a cluster with slots=3D20). I agree =
that it would make sense to add a flag which allows a job to grab more =
slots if they are available. The problem with that is however, that jobs =
can currently not change their parallelism. So if a job fails, it can =
not downscale to restart on the remaining slots.
> That's why the spare slots approach is currently the only way to go.
>=20
> Regards,
> Robert
>=20
> On Fri, Mar 18, 2016 at 1:30 PM, Ovidiu-Cristian MARCU =
<ovidiu-cristian.marcu@inria.fr <mailto:ovidiu-cristian.marcu@inria.fr>> =
wrote:
> Hi,
>=20
> For the situation where a program specify a maximum parallelism (so it =
is supposed to use all available task slots) we can have the possibility =
that one of the task managers is not registered for various reasons.
> In this case the job will fail for not enough free slots to run the =
job.
>=20
> For me this means the scheduler has a limitation to work by statically =
assign tasks to the task slots the job is configured.
>=20
> Instead I would like to be able to specify a minimum parallelism of a =
job but also the possibility to dynamically use more task slots if =
additional task slots can be used.
> Another use case will be that if during the execution of a job we lose =
one node so some task slots, if the minimum parallelism is still =
ensured, the job should recover and continue its execution instead of =
just failing.
>=20
> Is it possible to make such changes?
>=20
> Best,
> Ovidiu
>=20


--Apple-Mail=_39783C26-9203-41A4-B755-68A4940B005A
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=utf-8

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html =
charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" =
class=3D"">Hi Robert,<div class=3D""><br class=3D""></div><div =
class=3D"">I am not sure I understand so please confirm if I understand =
correctly your suggestions:</div><div class=3D"">- to use less slots =
than available slots capacity to avoid issues like when a TaskManager is =
not giving its slots because of some problems registering the =
TM;</div><div class=3D"">(This means I will lose some performance by not =
using all the available capacity)</div><div class=3D"">-if a job is =
failing because of losing a TaskManager (and its slots) the job will not =
restart even if available slots are free to use.</div><div class=3D"">(for=
 this case the =E2=80=98spare slots=E2=80=99 will not be of help right; =
losing a TM means the job will fail, no recovery)</div><div class=3D""><br=
 class=3D""></div><div class=3D"">Thanks!</div><div class=3D""><br =
class=3D""></div><div class=3D"">Best,</div><div =
class=3D"">Ovidiu</div><div class=3D""><br class=3D""></div><div =
class=3D""><br class=3D""><div><blockquote type=3D"cite" class=3D""><div =
class=3D"">On 21 Mar 2016, at 14:15, Robert Metzger &lt;<a =
href=3D"mailto:rmetzger@apache.org" class=3D"">rmetzger@apache.org</a>&gt;=
 wrote:</div><br class=3D"Apple-interchange-newline"><div class=3D""><div =
dir=3D"ltr" class=3D"">Hi Ovidiu,<div class=3D""><br class=3D""></div><div=
 class=3D"">right now the scheduler in Flink will not use more slots =
than requested.</div><div class=3D"">To avoid issues on recovery, we =
usually recommend users to have some spare slots (run job with p=3D15 on =
a cluster with slots=3D20). I agree that it would make sense to add a =
flag which allows a job to grab more slots if they are available. The =
problem with that is however, that jobs can currently not change their =
parallelism. So if a job fails, it can not downscale to restart on the =
remaining slots.</div><div class=3D"">That's why the spare slots =
approach is currently the only way to go.</div><div class=3D""><br =
class=3D""></div><div class=3D"">Regards,</div><div =
class=3D"">Robert</div></div><div class=3D"gmail_extra"><br =
class=3D""><div class=3D"gmail_quote">On Fri, Mar 18, 2016 at 1:30 PM, =
Ovidiu-Cristian MARCU <span dir=3D"ltr" class=3D"">&lt;<a =
href=3D"mailto:ovidiu-cristian.marcu@inria.fr" target=3D"_blank" =
class=3D"">ovidiu-cristian.marcu@inria.fr</a>&gt;</span> wrote:<br =
class=3D""><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br class=3D"">
<br class=3D"">
For the situation where a program specify a maximum parallelism (so it =
is supposed to use all available task slots) we can have the possibility =
that one of the task managers is not registered for various reasons.<br =
class=3D"">
In this case the job will fail for not enough free slots to run the =
job.<br class=3D"">
<br class=3D"">
For me this means the scheduler has a limitation to work by statically =
assign tasks to the task slots the job is configured.<br class=3D"">
<br class=3D"">
Instead I would like to be able to specify a minimum parallelism of a =
job but also the possibility to dynamically use more task slots if =
additional task slots can be used.<br class=3D"">
Another use case will be that if during the execution of a job we lose =
one node so some task slots, if the minimum parallelism is still =
ensured, the job should recover and continue its execution instead of =
just failing.<br class=3D"">
<br class=3D"">
Is it possible to make such changes?<br class=3D"">
<br class=3D"">
Best,<br class=3D"">
Ovidiu</blockquote></div><br class=3D""></div>
</div></blockquote></div><br class=3D""></div></body></html>=

--Apple-Mail=_39783C26-9203-41A4-B755-68A4940B005A--