Return-Path: X-Original-To: apmail-incubator-mesos-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-mesos-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D83A6108A8 for ; Mon, 6 May 2013 20:20:16 +0000 (UTC) Received: (qmail 52816 invoked by uid 500); 6 May 2013 20:20:16 -0000 Delivered-To: apmail-incubator-mesos-dev-archive@incubator.apache.org Received: (qmail 52777 invoked by uid 500); 6 May 2013 20:20:16 -0000 Mailing-List: contact mesos-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mesos-dev@incubator.apache.org Delivered-To: mailing list mesos-dev@incubator.apache.org Received: (qmail 52586 invoked by uid 99); 6 May 2013 20:20:16 -0000 Received: from reviews-vm.apache.org (HELO reviews.apache.org) (140.211.11.40) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 May 2013 20:20:16 +0000 Received: from reviews.apache.org (localhost [127.0.0.1]) by reviews.apache.org (Postfix) with ESMTP id 455FF1C95E0; Mon, 6 May 2013 20:20:12 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============3137888137650631014==" MIME-Version: 1.0 Subject: Re: Review Request: Terminate executors that aren't needed. From: "Brian Wickman" To: "Vinod Kone" , "Ben Mahler" , "Brian Wickman" , "mesos" , "Brenden Matthews" Date: Mon, 06 May 2013 20:20:12 -0000 Message-ID: <20130506202012.5514.90649@reviews.apache.org> X-ReviewBoard-URL: https://reviews.apache.org Auto-Submitted: auto-generated Sender: "Brian Wickman" X-ReviewGroup: mesos X-ReviewRequest-URL: https://reviews.apache.org/r/10932/ X-Sender: "Brian Wickman" References: <20130503193720.17260.92709@reviews.apache.org> In-Reply-To: <20130503193720.17260.92709@reviews.apache.org> Reply-To: "Brian Wickman" --===============3137888137650631014== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable > On May 3, 2013, 7:37 p.m., Vinod Kone wrote: > > src/slave/slave.cpp, lines 1085-1089 > > > > > > this seems like the wrong thing to do. an executor can run more tha= n one task. why do you want to kill the executor if it could get more tasks? > = > Brenden Matthews wrote: > I think this is a bug. > = > I've had many cases where the executor launches, starts a task, and t= he task is killed before it has finished launching. This results in the ta= sk continuing to run indefinitely or until the mesos slave process is resta= rted. > = > Vinod Kone wrote: > I don't think I follow the sequence of events here. Is it as follows? > = > --> Slave launches an executor > --> Before the executor registers with the slave, the framework asks = to kill a task (do you know why?) > --> When the executor registers it doesn't get any task from the slave > --> The executor is running without any task. > = > I don't understand what do you mean by "task continuing to run indefi= nitely". Do you mean "the executor runs indefinitely"? If its the latter, i= t seems the right semantics for a general purpose executor. Am I missing so= mething? > = > Brenden Matthews wrote: > I'm sorry, I realize that wasn't very clear. I went digging for logs= but I can't find an example (it seems to have been all rotated out). > = > And yes, that sounds correct. > = > I'm not actually sure what the cause of this is. The Hadoop schedule= r will occasionally kill tasks, so it could be that (but I haven't scoured = the logs to determine the cause). > = > Ben Mahler wrote: > I agree with Brenden here that this is unexpected. Currently, all exe= cutors have to handle the case where they start and _never_ receive a launc= hTask. That seems broken to me, since the expectation is that we've launche= d the executor in order to launch a task in the first place. > = > After talking with Vinod I think there are two ways to "fix" this: > = > 1) In the ExecutorDriver, create a timeout to commit suicide when no = launch task is received within, say, 10 seconds of registering with the sla= ve. > = > 2) Send the launch task to the executor anyway, immediately followed = by the kill task request. This is tricky. > = > 3) Leave as is and have the MesosExecutor for Hadoop commit suicide i= f no task is received within 10 seconds of registration. Again, this only f= ixes the issue for the Hadoop executor. > = > So 1 seems to be the best option here. Other thoughts Vinod? Brenden,= do you want to take that on, or file an issue? > = > Brenden Matthews wrote: > I wrote a quick patch, if I understand your proposal correctly. > = > Vinod Kone wrote: > Sorry for the back and forth Brenden. After talking with Ben Mahler a= nd looking at your diffs, I would actually prefer your first solution becau= se its very simple. = > = > I didn't realize that you were only killing the executor if it hasn't= registered yet. Since we only launch an executor if there is a task for it= , I think its actually fine to send a kill when queued and launched tasks a= re empty and the executor hasn't registered yet. > = > that said, we need some re-formatting on your 1st diff :). Our format= ting rule is to put all args on the second line if that fits within 80, oth= erwise each arg on a different line. > = > dispatch( > isolator, &Isolator::killExecutor, framework->id, executor->id); > = > @benh do you have any comments? > = > #1 is precisely what twitter's executor does: in executor's main(): # Create driver stub driver =3D mesos.MesosExecutorDriver(thermos_executor) # This is an ephemeral executor -- shutdown if we receive no tasks within= a certain # time period ThermosExecutorTimer(thermos_executor, driver).start() # Start executor. driver.run() ThermosExecutorTimer: class ThermosExecutorTimer(ExceptionalThread): EXECUTOR_TIMEOUT =3D Amount(10, Time.SECONDS) def __init__(self, executor, driver): self._executor =3D executor self._driver =3D driver super(ThermosExecutorTimer, self).__init__() self.daemon =3D True def run(self): self._executor.launched.wait(self.EXECUTOR_TIMEOUT.as_(Time.SECONDS)) if not self._executor.launched.is_set(): self._executor.log('Executor timing out on lack of launchTask.') self._driver.stop() - Brian ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/10932/#review20136 ----------------------------------------------------------- On May 6, 2013, 7:07 p.m., Brenden Matthews wrote: > = > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/10932/ > ----------------------------------------------------------- > = > (Updated May 6, 2013, 7:07 p.m.) > = > = > Review request for mesos. > = > = > Description > ------- > = > From 607072595b91993e2d47251ee841fb3dc5d84e05 Mon Sep 17 00:00:00 2001 > From: Brenden Matthews > Date: Fri, 3 May 2013 09:47:22 -0700 > Subject: [PATCH 8/9] Terminate executors that aren't needed. > = > If we launch an executor and then kill the task immediately after, make > sure we also terminate the executor when there are no other tasks. > --- > src/slave/slave.cpp | 48 +++++++++++++++++++++++++++------------------= --- > 1 file changed, 27 insertions(+), 21 deletions(-) > = > = > Diffs > ----- > = > include/mesos/executor.hpp 9b25834 = > src/exec/exec.cpp 1f022ca = > = > Diff: https://reviews.apache.org/r/10932/diff/ > = > = > Testing > ------- > = > Used in production at airbnb. > = > = > Thanks, > = > Brenden Matthews > = > --===============3137888137650631014==--