Return-Path: X-Original-To: apmail-mesos-dev-archive@www.apache.org Delivered-To: apmail-mesos-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7C5DF177B8 for ; Tue, 31 Mar 2015 18:22:39 +0000 (UTC) Received: (qmail 98027 invoked by uid 500); 31 Mar 2015 18:21:40 -0000 Delivered-To: apmail-mesos-dev-archive@mesos.apache.org Received: (qmail 97966 invoked by uid 500); 31 Mar 2015 18:21:40 -0000 Mailing-List: contact dev-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list dev@mesos.apache.org Received: (qmail 97945 invoked by uid 99); 31 Mar 2015 18:21:39 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 31 Mar 2015 18:21:39 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of niklas@mesosphere.io designates 209.85.212.200 as permitted sender) Received: from [209.85.212.200] (HELO mail-wi0-f200.google.com) (209.85.212.200) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 31 Mar 2015 18:21:35 +0000 Received: by wixm2 with SMTP id m2so6336014wix.0 for ; Tue, 31 Mar 2015 11:21:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mesosphere.io; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=B4jCF47UJIHgHPQqznyqnPp9rZp1fSD0Ylp0YzzIlNQ=; b=hM8m+RAX15WkJSjeJf+IKutK/uIZ0qrkn/owVuDUXGFtfEWPjkwVFP5skwv0IkD3PB AxrCHnDojeasb3EL6r79zpQ/1/AHAU91xWOkHoTJDdSFq7AzCushfaHJ4vT2VTm7wkHq bZEs2N1z9EJLbhngbKzZhO/24k/lF40tc0ue4= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-type; bh=B4jCF47UJIHgHPQqznyqnPp9rZp1fSD0Ylp0YzzIlNQ=; b=NqbnpM/1ZarzR0VW2JN9arPKbPixV6wDH0VGvuSL/qDKD1OP/eXfoTORkcT0OtRKup dpU2dp/cgLOLhyIQRk+FWIYN3fl4EHmXTISlunYlEnXNoLz4WgShLyLjfCUjYwrvIXYR XqegytO40XKswMzG6TCiptJF/sgfU2q+e0ThhCvZ2fEMWGqms2S2srHjPAesyTNW4vhF I1bwliJmPBYOw/A7TP44UvXW15uZkfxokR/eLBBUSR/7i0TkrHIlEAdZ47e5fudNF5DU wmfBdDI+ASjdUuBdLEJj2dU/wVwE0FzoKh7zlR4TqTs0E5HF9MwMg60NbHywEvTYWmxq f3DQ== X-Gm-Message-State: ALoCoQk4PVVerGKiPvC2ChaYEUJ/LdYQpM5MWFFY2DorEU6vPO4SDL65bOjgFEFqds/56PvOpHKh X-Received: by 10.180.104.7 with SMTP id ga7mr7462830wib.37.1427826071679; Tue, 31 Mar 2015 11:21:11 -0700 (PDT) MIME-Version: 1.0 Received: by 10.27.135.85 with HTTP; Tue, 31 Mar 2015 11:20:51 -0700 (PDT) In-Reply-To: References: From: Niklas Nielsen Date: Tue, 31 Mar 2015 11:20:51 -0700 Message-ID: Subject: Re: bug in 0.22.0? To: Henning Schmiedehausen Cc: Benjamin Hindman , dev , Joerg Schad , Brenden Matthews Content-Type: multipart/alternative; boundary=f46d04427186cdc7d3051299a9ce X-Virus-Checked: Checked by ClamAV on apache.org --f46d04427186cdc7d3051299a9ce Content-Type: text/plain; charset=UTF-8 Hi Henning, It does sound suspicious. Would you mind capturing this in a JIRA ticket (targeted 0.22.1) so we can track progress and assign an owner? Thanks, Niklas On 31 March 2015 at 10:00, Henning Schmiedehausen < hschmiedehausen@groupon.com> wrote: > Ok, so here is my log file from the slave: > https://gist.github.com/hgschmie/fc20b361a2cadeba0fbd > > - Slave shuts down executor: > > I0330 16:27:59.187947 18428 slave.cpp:1558] Asked to kill task > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of > framework Singularity_local > I0330 16:27:59.239296 18425 slave.cpp:2508] Handling status update > TASK_KILLED (UUID: 7f4fdc95-3a7d-474d-b4f1-b6da45e96396) for task > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of > framework Singularity_local from executor(1)@10.101.131.128:55377 > I0330 16:27:59.239320 18425 slave.cpp:4468] Terminating task > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT > I0330 16:27:59.240444 18424 status_update_manager.cpp:317] Received > status update TASK_KILLED (UUID: 7f4fdc95-3a7d-474d-b4f1-b6da45e96396) > for task MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT > of framework Singularity_local > I0330 16:27:59.240613 18425 slave.cpp:2680] Status update manager > successfully handled status update TASK_KILLED (UUID: > 7f4fdc95-3a7d-474d-b4f1-b6da45e96396) for task > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of > framework Singularity_local > I0330 16:27:59.240623 18425 slave.cpp:2686] Sending acknowledgement > for status update TASK_KILLED (UUID: > 7f4fdc95-3a7d-474d-b4f1-b6da45e96396) for task > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of > framework Singularity_local to executor(1)@10.101.131.128:55377 > I0330 16:28:00.443361 18422 slave.cpp:3193] Executor > 'MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT' of > framework Singularity_local exited with status 0 > > Now my framework receives the TASK_KILLED (I see that) but fails to > send the ack message back to mesos. After 30 seconds: > > W0330 16:28:30.097679 18426 status_update_manager.cpp:472] Resending > status update TASK_RUNNING (UUID: > ea431985-98de-4447-8668-fda26c95f040) for task > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of > framework Singularity_local > I0330 16:28:30.097704 18426 status_update_manager.cpp:371] Forwarding > update TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for > task MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT > of framework Singularity_local to the slave > I0330 16:28:30.097761 18426 slave.cpp:2753] Forwarding the update > TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for task > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of > framework Singularity_local to master@10.101.131.128:5050 > > After another 20 seconds: > > W0330 16:29:50.098641 18425 status_update_manager.cpp:472] Resending > status update TASK_RUNNING (UUID: > ea431985-98de-4447-8668-fda26c95f040) for task > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of > framework Singularity_local > I0330 16:29:50.098659 18425 status_update_manager.cpp:371] Forwarding > update TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for > task MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT > of framework Singularity_local to the slave > I0330 16:29:50.098737 18422 slave.cpp:2753] Forwarding the update > TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for task > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of > framework Singularity_local to master@10.101.131.128:5050 > > so for my task " > MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT", > which is finished (the executor has exited), the slave is now > resending "TASK_RUNNING". And that can not be right. This should be > "TASK_KILLED". > > As a result, the state machine in my framework is now confused, > because a task just transitioned from TASK_KILLED to TASK_RUNNING. > Which should not happen, because TASK_KILLED is a terminal state. > > So I guess that a bug in my framework code has exposed a bug in the > mesos slave state resending code. > > -h > > (BTW, I am a bit concerned that these discussions do not happen on the > mesos-dev list. I added it in Cc. This is still an Apache project, > right? (Wears his Apache hat)). > > On Mon, Mar 30, 2015 at 9:13 PM, Henning Schmiedehausen > wrote: > > Hi Ben, > > > > quick question about a comment in the mesos source code: > > > https://github.com/apache/mesos/blob/master/src/sched/sched.cpp#L695-L703 > > > > // If the update is driver-generated or master-generated, it > > // does not require acknowledgement and so we unset the 'uuid' > > // field of TaskStatus. Otherwise, we overwrite the field to > > // ensure that a 0.22.0 scheduler driver supports explicit > > // acknowledgements, even if running against a 0.21.0 cluster. > > // > > // TODO(bmahler): Update the slave / executor driver to ensure > > // that 'uuid' is set accurately by the time it reaches the > > // scheduler driver. This will be required for pure bindings. > > if (from == UPID() || pid == UPID()) { > > > > No matter how much I twist and turn this piece of code, but I > > interpret that as "if the sender of the message or the pid from the > > message itself is the local address". Why would that skip > > master-generated updates? > > > > In fact, my problems seem to be related to a misinterpretation of this > > statement in my java driver (jesos, https://github.com/groupon/jesos). > > However I believe that this uncovered an actual bug in the slave > > behavior: > > > > - Executor exits, Slave sends TaskStatus "TASK_KILLED" message to master > > - Master sends the message to the Framework. > > - Framework code fails to acknowledge the message (which is a > > legitimate error case, because messages can get lost) > > - After a timeout, the Slave resends the TaskStatus message, however, > > it does so with state TASK_RUNNING, not TASK_KILLED. > > > > I could verify that behavior in 0.22.0 and the current trunk. I have a > > bit more debugging in a log file in the office, i will send that to > > you tomorrow. > > > > -h > > > > > > > > On Mon, Mar 30, 2015 at 7:45 PM, Benjamin Hindman > wrote: > >> Hey Henning! > >> > >> Someone mentioned today that you were discussing a potential bug in > 0.22.0 > >> on IRC? Can you elaborate? We are taking this very seriously and would > like > >> to make sure any issues are dealt with promptly. > >> > >> Ben. > > > > > > > > -- > > Henning Schmiedehausen - Groupon Engineering Principal > > Groupon Inc, Palo Alto - henning@groupon.com > > > > -- > Henning Schmiedehausen - Groupon Engineering Principal > Groupon Inc, Palo Alto - henning@groupon.com > --f46d04427186cdc7d3051299a9ce--