Return-Path: X-Original-To: apmail-mesos-dev-archive@www.apache.org Delivered-To: apmail-mesos-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E3F1710D71 for ; Thu, 29 May 2014 02:37:01 +0000 (UTC) Received: (qmail 62575 invoked by uid 500); 29 May 2014 02:37:01 -0000 Delivered-To: apmail-mesos-dev-archive@mesos.apache.org Received: (qmail 62517 invoked by uid 500); 29 May 2014 02:37:01 -0000 Mailing-List: contact dev-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list dev@mesos.apache.org Received: (qmail 62493 invoked by uid 99); 29 May 2014 02:37:01 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 May 2014 02:37:01 +0000 Date: Thu, 29 May 2014 02:37:01 +0000 (UTC) From: "Till Toenshoff (JIRA)" To: dev@mesos.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (MESOS-1431) io::splice usage needs special care - especially in connection with process::subprocess MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MESOS-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011987#comment-14011987 ] Till Toenshoff commented on MESOS-1431: --------------------------------------- https://reviews.apache.org/r/22001/ (MesosContainerizer) https://reviews.apache.org/r/21966/ (ExternalContainerizer) I shouldnt have bundled the original cloexec bug of the EC with the splice fix on the EC in the first place. Now things got confusing and I have to apologize for that. > io::splice usage needs special care - especially in connection with process::subprocess > --------------------------------------------------------------------------------------- > > Key: MESOS-1431 > URL: https://issues.apache.org/jira/browse/MESOS-1431 > Project: Mesos > Issue Type: Bug > Affects Versions: 0.19.0 > Reporter: Till Toenshoff > Assignee: Till Toenshoff > Labels: bug, splice, subprocess > > When using {{process::subproces}} in connection with {{io::splice}}, make sure you work with extra care. > Consider this piece of code (ripped from EC), which looks rather unspectacular - very straightforward use of subprocess and splice. > {noformat} > // Fork exec of external process. > Try external = process::subprocess( > execute, > environment); > if (external.isError()) { > return Error("Failed to execute external containerizer: " + > external.error()); > } > // Set stderr into non-blocking mode. > Try nonblock = os::nonblock(external.get().err()); > if (nonblock.isError()) { > return Error("Failed to accept nonblock: " + nonblock.error()); > } > // Redirect output (stderr) from the subprocess to a log > // file. > Try err = os::open( > sandbox.isSome() ? path::join(sandbox.get().directory, "stderr") > : "/dev/null", > O_WRONLY | O_CREAT | O_APPEND | O_NONBLOCK, > S_IRUSR | S_IWUSR | S_IRGRP | S_IRWXO); > if (err.isError()) { > return Error( > "Failed to redirect stderr: Failed to open: " + > err.error()); > } > Try cloexec = os::cloexec(err.get()); > if (cloexec.isError()) { > os::close(err.get()); > return Error( > "Failed to redirect stderr: Failed to set close-on-exec: " + > cloexec.error()); > } > io::splice(external.get().err(), err.get()) > .onAny(&os::close, err.get())); > {noformat} > The above code is buggy as subprocess will close {{external.get().err()}} once the child got reaped. So the FD we are reading (splicing) from potentially gets closed before the splicer was able to get the {{EOF}}. That in turn will cause libev to continue polling that FD (remember, it never reached a final state) and that is where havoc breaks lose as we will now see data getting lost that is sent from reused FDs. > > If you do not {{dup}} the subprocess returned FDs before using an {{io::splice}} on it, thus not taking full ownership of the FD, you will end up with fancy problems. > So a fix would be replacing the last two lines by this: > {noformat} > // Duplicate 'from' so that we're in control of it's lifetime, > // exceeding the lifetime of the reaped subprocess. It is needed > // to make sure the splicer had a chance to read and process the > // EOF. > int fd = dup(external.get().err()); > if (fd == -1) { > os::close(err.get()); > return ErrnoError("Failed to duplicate stderr file descriptor"); > } > io::splice(fd, err.get()) > .onAny(bind(&_invoke, fd, err.get())); > {noformat} > And also adding an onAny triggered function (_invoke in this case) that is closing both, the dup'ed FD and the log-file FD. > The bad news are that the EC is not the only implementation that contains such buggy code. There is at least one more such bug found within mesos_containerizer.cpp:680 and we should now all check for more such use-cases and fix them before releasing. -- This message was sent by Atlassian JIRA (v6.2#6252)