mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Mahler <benjamin.mah...@gmail.com>
Subject Re: Framework still active after calling driver.stop
Date Tue, 19 May 2015 19:12:18 GMT
This is because stop() is asynchronously processed. A message is sent to
the scheduler process and it will eventually send the message to the
master. This is why you've noticed that sleeping helps to ensure that this
occurs.

There is no scheduler driver specific issue for this, but the executor side
one was discussed here:
https://issues.apache.org/jira/browse/MESOS-243

The scheduler case is different, so I've filed a ticket here:
https://issues.apache.org/jira/browse/MESOS-2751

Answering your questions:

(1) After driver.stop(false) (false is the default) completes, eventually
we will send an unregistration message. No guarantees on how long this can
take, but sleeping for some number of seconds should capture this fairly
well for small scale schedulers.

(2) If your MyMesosScheduler is able to handle being re-used, then yes, I
believe that should be ok. I assume you have some compelling reason not to
just create a new object.

(3) Watch the ticket I created for updates, you'll have to rely on the
sleep for now unfortunately. But hopefully now that you understand what is
happening it feels less like voodoo. :)

Ben

On Thu, May 14, 2015 at 12:21 AM, Itamar Ostricher <itamar@yowza3d.com>
wrote:

> Hi,
> We have a production pipeline running a series of jobs, with each job
> creating a custom mesos framework to execute all tasks related to that job.
> Both scheduler and executor are written using the Python mesos API.
> Here's a snippet (modified for brevity) of the scheduler code:
>
> class MyMesosScheduler(mesos.Scheduler):
> <...>
>   def run_jobs(self, jobs):
>     for job in jobs:
>       framework = mesos_pb2.FrameworkInfo()
>       <...>
>       driver = MesosSchedulerDriver(self, framework, Flags.mesos_master)
>       driver.start()
>       for task in job.generate_tasks():
>         <...>
>       # wait for all tasks to complete
>       driver.stop()
> <...>
>
> This usually works just fine, but sometimes the pipeline gets "stuck" on
> the second framework, and I can see on the mesos dashboard that the first
> framework is still "active":
> [image: Inline image 1]
>
> I know driver.stop() for the first framework was called and has returned
> (from my logs, and from the fact that the following job started). I also
> see this in the console where the scheduler is running:
> I0514 07:05:16.819201 29486 sched.cpp:1286] Asked to stop the driver
>
> *If I add time.sleep(0.1) after driver.stop() the problem disappears!*
> I tried adding driver.join() after driver.stop(), but the behavior was the
> same (the join() returned immediately).
> I tried adding "del driver" after driver.stop(), but the behavior was the
> same.
>
> *So my questions are:*
> - What is promised to me by the mesos API on return from "driver.stop()" ?
> (I thought the promise is that the framework successfully stopped,
> including stopping all executors)
> - Is it safe to "recycle" the same instance of MyMesosScheduler for
> multiple (consecutive, not overlapping) frameworks? (note that the driver
> object is brand new for every framework)
> - Any thoughts on the problem I'm describing, and potential solutions that
> are not based on a voodoo sleep?
>
> Thanks!
> - Itamar.
>

Mime
View raw message