Return-Path: X-Original-To: apmail-mesos-dev-archive@www.apache.org Delivered-To: apmail-mesos-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 11F2F17271 for ; Wed, 22 Apr 2015 00:38:02 +0000 (UTC) Received: (qmail 57882 invoked by uid 500); 22 Apr 2015 00:38:01 -0000 Delivered-To: apmail-mesos-dev-archive@mesos.apache.org Received: (qmail 57812 invoked by uid 500); 22 Apr 2015 00:38:01 -0000 Mailing-List: contact dev-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list dev@mesos.apache.org Received: (qmail 57793 invoked by uid 99); 22 Apr 2015 00:38:01 -0000 Received: from reviews-vm.apache.org (HELO reviews.apache.org) (140.211.11.40) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Apr 2015 00:38:01 +0000 Received: from reviews.apache.org (localhost [127.0.0.1]) by reviews.apache.org (Postfix) with ESMTP id E92361CE0CC; Wed, 22 Apr 2015 00:38:02 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============5313354592069977007==" MIME-Version: 1.0 Subject: Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure From: "Timothy Chen" To: "Vinod Kone" , "Ben Mahler" , "Timothy Chen" Cc: "Jay Buffington" , "mesos" , "Jie Yu" Date: Wed, 22 Apr 2015 00:38:02 -0000 Message-ID: <20150422003802.2947.8183@reviews.apache.org> X-ReviewBoard-URL: https://reviews.apache.org/ Auto-Submitted: auto-generated Sender: "Timothy Chen" X-ReviewGroup: mesos X-ReviewRequest-URL: https://reviews.apache.org/r/33249/ X-Sender: "Timothy Chen" References: <20150421232528.2947.55966@reviews.apache.org> In-Reply-To: <20150421232528.2947.55966@reviews.apache.org> Reply-To: "Timothy Chen" X-ReviewRequest-Repository: mesos --===============5313354592069977007== MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit > On April 21, 2015, 11:25 p.m., Jie Yu wrote: > > src/slave/slave.cpp, lines 3065-3078 > > > > > > Instead of doing that in your way, can we just try to make sure `containerizer->wait` here will return a failure (or a Termination with some reason) when `containerizer->launch` fails. In that way, the `executorTerminated` will properly send status updates to the slave (TASK_LOST/TASK_FAILED). > > > > Or am I missing something? > > Jie Yu wrote: > OK, I think I got confused by the ticket. There are actually two problems here. The problem I am refering to is the fact that we don't send status update to the scheduler if containerizer launch fails until executor reregistration timeout happens. Since for docker containerizer, someone might use a very large timeout value, ideally, the slave should send a status update to the scheduler right after containerizer launch fails. > > After chat with Jay, the problem you guys are refering to is the fact that the scheduler cannot disinguish between the case where the task has failed vs. the case where the configuration of a task is not correct, because in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST. > > Jie Yu wrote: > To address the first problem, I think the simplest way is to add a containerizer->destroy(..) in executorLaunched when containerizer->launch fails. In that way, it's going to trigger containerizer->wait and thus send status update to the scheduler. > > Jie Yu wrote: > Regarding the second problem, IMO, we should include a reason field in Termination (https://issues.apache.org/jira/browse/MESOS-2035) and let sendExecutorTerminatedStatusUpdate to propagate the termination reason to the scheduler. Reason field sounds good, I think what you proposed makes sense, in docker containerizer at least we also need to make sure termination message is set correctly as currently it doesn't contain all the error information that we pass back to the launch future. - Timothy ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33249/#review81090 ----------------------------------------------------------- On April 21, 2015, 5:14 p.m., Jay Buffington wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/33249/ > ----------------------------------------------------------- > > (Updated April 21, 2015, 5:14 p.m.) > > > Review request for mesos, Ben Mahler, Timothy Chen, and Vinod Kone. > > > Bugs: MESOS-2020 > https://issues.apache.org/jira/browse/MESOS-2020 > > > Repository: mesos > > > Description > ------- > > When mesos is unable to launch the containerizer the scheduler should > get a TASK_FAILED with a status message that includes the error the > containerizer encounted when trying to launch. > > Introduces a new TaskStatus: REASON_CONTAINERIZER_LAUNCH_FAILED > > Fixes MESOS-2020 > > > Diffs > ----- > > include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 > src/slave/slave.cpp 8ec80ed26f338690e0a1e712065750ab77a724cd > src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 > > Diff: https://reviews.apache.org/r/33249/diff/ > > > Testing > ------- > > I added test case to slave_test.cpp. I also tried this with Aurora, supplied a bogus docker image url and saw the "docker pull" failure stderr message in Aurora's web UI. > > > Thanks, > > Jay Buffington > > --===============5313354592069977007==--