mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Till Toenshoff (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-1431) io::splice usage needs special care - especially in connection with process::subprocess
Date Thu, 29 May 2014 00:20:01 GMT

    [ https://issues.apache.org/jira/browse/MESOS-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011864#comment-14011864
] 

Till Toenshoff commented on MESOS-1431:
---------------------------------------

Addendum; the better fix would be using the freshly proposed https://reviews.apache.org/r/21998
instead of io::splice as that one takes care of the FD ownership.

> io::splice usage needs special care - especially in connection with process::subprocess
> ---------------------------------------------------------------------------------------
>
>                 Key: MESOS-1431
>                 URL: https://issues.apache.org/jira/browse/MESOS-1431
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.19.0
>            Reporter: Till Toenshoff
>              Labels: bug, splice, subprocess
>
> When using {{process::subproces}} in connection with {{io::splice}}, make sure you work
with extra care. 
> Consider this piece of code (ripped from EC), which looks rather unspectacular - very
straightforward use of subprocess and splice.
> {noformat}
>   // Fork exec of external process.
>   Try<Subprocess> external = process::subprocess(
>       execute,
>       environment);
>   if (external.isError()) {
>     return Error("Failed to execute external containerizer: " +
>                  external.error());
>   }
>   // Set stderr into non-blocking mode.
>   Try<Nothing> nonblock = os::nonblock(external.get().err());
>   if (nonblock.isError()) {
>     return Error("Failed to accept nonblock: " + nonblock.error());
>   }
>   // Redirect output (stderr) from the subprocess to a log
>   // file.
>   Try<int> err = os::open(
>       sandbox.isSome() ? path::join(sandbox.get().directory, "stderr")
>                        : "/dev/null",
>       O_WRONLY | O_CREAT | O_APPEND | O_NONBLOCK,
>       S_IRUSR | S_IWUSR | S_IRGRP | S_IRWXO);
>   if (err.isError()) {
>     return Error(
>         "Failed to redirect stderr: Failed to open: " +
>         err.error());
>   }
>   Try<Nothing> cloexec = os::cloexec(err.get());
>   if (cloexec.isError()) {
>     os::close(err.get());
>     return Error(
>         "Failed to redirect stderr: Failed to set close-on-exec: " +
>         cloexec.error());
>   }
>   io::splice(external.get().err(), err.get())
>     .onAny(&os::close, err.get()));
> {noformat}
> The above code is buggy as subprocess will close {{external.get().err()}} once the child
got reaped. So the FD we are reading (splicing) from potentially gets closed before the splicer
was able to get the {{EOF}}. That in turn will cause libev to continue polling that FD (remember,
it never reached a final state) and that is where havoc breaks lose as we will now see data
getting lost that is sent from reused FDs.
>  
> If you do not {{dup}} the subprocess returned FDs before using an {{io::splice}} on it,
thus not taking full ownership of the FD, you will end up with fancy problems.
> So a fix would be replacing the last two lines by this:
> {noformat}
>   // Duplicate 'from' so that we're in control of it's lifetime,
>   // exceeding the lifetime of the reaped subprocess. It is needed
>   // to make sure the splicer had a chance to read and process the
>   // EOF.
>   int fd = dup(external.get().err());
>   if (fd == -1) {
>     os::close(err.get());
>     return ErrnoError("Failed to duplicate stderr file descriptor");
>   }
>   io::splice(fd, err.get())
>     .onAny(bind(&_invoke, fd, err.get()));
> {noformat}
> And also adding an onAny triggered function (_invoke in this case) that is closing both,
the dup'ed FD and the log-file FD.
> The bad news are that the EC is not the only implementation that contains such buggy
code. There is at least one more such bug found within mesos_containerizer.cpp:680 and we
should now all check for more such use-cases and fix them before releasing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message