mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ilya Pronin (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (MESOS-1648) Add a --pidfile option to master and agent binaries.
Date Fri, 25 Nov 2016 15:41:58 GMT

     [ https://issues.apache.org/jira/browse/MESOS-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ilya Pronin reassigned MESOS-1648:
----------------------------------

    Assignee: Ilya Pronin

> Add a --pidfile option to master and agent binaries.
> ----------------------------------------------------
>
>                 Key: MESOS-1648
>                 URL: https://issues.apache.org/jira/browse/MESOS-1648
>             Project: Mesos
>          Issue Type: Improvement
>          Components: agent, master
>            Reporter: Tobias Weingartner
>            Assignee: Ilya Pronin
>              Labels: newbie, twitter
>
> Right now we use a number of wrapper scripts to try and keep up a {{/var/run/mesos/mesos-slave.pid}}
in order to be able to monitor the process.  This has proven to be somewhat fragile due to
the lack of locking and the possibility of races and stale data.
> By adding a {{--pidfile}}, we can obtain a lock on the file to prevent multiple binaries
from starting, and to enable the tooling to validate that the lock is held before doing any
signaling. We can also do a best effort unlink in the signal handler upon termination:
> {code}
> // Get exclusive access to the file.
> fd = open(O_CREAT ...)
> flock(fd, LOCK_EX)
> if not locked, abort
> ftruncate(fd, 0)
> // Write the pid.
> write(fd, "<pid>")
> // Inside signal handler..
> unlink(pidfile)
> {code}
> Digging around, looks like the open, ftruncate, write pattern is pretty common:
> http://man7.org/tlpi/code/online/diff/filelock/create_pid_file.c.html
> The tooling around it could that the file is locked by the pid inside it, before taking
any action (like signaling):
> *Case 1*: If the file does not exist or is not locked, then assume nothing is running.
It's possible for something to be running and about to grab the lock, but we'll eventually
read it correctly and converge on a single instance started correctly.
> *Case 2*: If the file is locked, and the pid doesn't match, then assume it is running
but not as the pid in the file (.. yet). Treat this the same as (1), assume it's not running,
and the next attempts to start will eventually converge on a single instance running.
> *Case 3*: If the file is locked, and the pid matches the locker process, then assume
it is running as that pid. Note that it's still possible that in between matching the pid
and taking an action (e.g. kill), the pid may become stale, but the recycling pattern of pids
makes it unlikely to be re-used unless there is a large delay.
> It seems like some tools already do this signal wrapping (note the comment about fcntl
and note the race from (3) in the BUGS section):
> http://manpages.ubuntu.com/manpages/natty/man8/ovs-kill.8.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message