Return-Path: X-Original-To: apmail-incubator-mesos-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-mesos-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6D2A1DC02 for ; Thu, 23 May 2013 17:02:55 +0000 (UTC) Received: (qmail 41379 invoked by uid 500); 23 May 2013 17:02:55 -0000 Delivered-To: apmail-incubator-mesos-dev-archive@incubator.apache.org Received: (qmail 41174 invoked by uid 500); 23 May 2013 17:02:53 -0000 Mailing-List: contact mesos-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mesos-dev@incubator.apache.org Delivered-To: mailing list mesos-dev@incubator.apache.org Received: (qmail 41146 invoked by uid 99); 23 May 2013 17:02:52 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 May 2013 17:02:52 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of vinod@twitter.com designates 209.85.160.43 as permitted sender) Received: from [209.85.160.43] (HELO mail-pb0-f43.google.com) (209.85.160.43) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 May 2013 17:02:45 +0000 Received: by mail-pb0-f43.google.com with SMTP id ma3so3144615pbc.2 for ; Thu, 23 May 2013 10:02:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=twitter.com; s=google; h=mime-version:sender:in-reply-to:references:from:date :x-google-sender-auth:message-id:subject:to:content-type; bh=0trEfI2JXSOoU/mTB7QeiqfsA3hFsBYMuWD9X43VW0M=; b=AXT+TwbPAAdPgTfF/zbdTC+VPyCqaMP+BAdxCb3TyWDrFhQ3U+y4S0ouLLUbaw+BqO bWC5oqOWwkI0Q8eiedjkDONPrHxS9L6HpSiZ8RozZ0LxQPpSlJXVjmjVwocsudPSPoOR 8YOBXEQj9Byy2ySRYh0+3Mejyk4Me4kMvtcKg= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date :x-google-sender-auth:message-id:subject:to:content-type; bh=0trEfI2JXSOoU/mTB7QeiqfsA3hFsBYMuWD9X43VW0M=; b=u5PG4tFHwUbCPBsCieYS2AB6tCo+Zfvwj/Cm/kgF0bEakCUeWRhMNSiMrbpIi9rPqO o4YY1MHwyV4UHaRvlboo5jlDEdq0c1JlYEsmPnz8kwm89fTwfw/uDVN72eMICDeZqtKw Yjs31b6iluxNUAe/S+QDZV9R3ER69tPzI1y/YnzNn+kGBB/YUhdW2RTTPZ7HzOq2ZvNs N4Fy6YD52Phng6kzYO5rucVgDK4C85GFEylgc9old36MXLN4uuNVZtxnF2pefKQr/KHZ oG0V+pcV5wyQbR30o88nY4QckF+EWIGq/l/tW088U9yvni374EwjNFwvLSTQ8hIFpJRz 8ScA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date :x-google-sender-auth:message-id:subject:to:content-type :x-gm-message-state; bh=0trEfI2JXSOoU/mTB7QeiqfsA3hFsBYMuWD9X43VW0M=; b=mrF3el/dfVYxaHngOtlx6QDBZcDA6PjGjpgOOm81ebdx2FDToYpsUX2XX8wBEETzo4 2kOmveQoQQjI9L+qrb+VVl9Fadtr7JTtVjpDcHQmnLaPc+GJ8QjB8Qws9AXxSCvWukbr g1g2mOjcsxAd932vbWa02qE8VVyDuWJi+rDV/ft9VSkMwmfcRuCTm8zAUtCzf2BbGO/I 2qkpHrWVv06I8q59mV0RPl0C5OPHTQsxUHQfIySOcA1e/vYzDpoBKuAKKEXG6YaEMlgJ /O9+co0Rpa/1AtufYa1PcrYYCFW3FfJtCZuwcB+lkka0rROFbsEwYdZ1xZ4GKhhq7rrx iBgQ== X-Received: by 10.68.51.234 with SMTP id n10mr13790492pbo.221.1369328544451; Thu, 23 May 2013 10:02:24 -0700 (PDT) MIME-Version: 1.0 Sender: vinod@twitter.com Received: by 10.68.247.37 with HTTP; Thu, 23 May 2013 10:02:04 -0700 (PDT) In-Reply-To: References: From: Vinod Kone Date: Thu, 23 May 2013 10:02:04 -0700 X-Google-Sender-Auth: CXCbSZMT1rfgx9K-0iBxy50QO_k Message-ID: Subject: Fwd: Question about TASK_LOST statuses To: "mesos-dev@incubator.apache.org" Content-Type: multipart/alternative; boundary=bcaec53962b07920c304dd65a514 X-Gm-Message-State: ALoCoQkb4FTVGSVZz41pVpQXfr4iEJ42rXOf2rApv1yDZ2J63Ve620/kquk3iVmsehU4FOCVSDpN X-Virus-Checked: Checked by ClamAV on apache.org --bcaec53962b07920c304dd65a514 Content-Type: text/plain; charset=ISO-8859-1 ---------- Forwarded message ---------- From: Vinod Kone Date: Sun, May 19, 2013 at 6:56 PM Subject: Re: Question about TASK_LOST statuses To: "mesos-dev@incubator.apache.org" On the master's logs, I see this: > - 5600+ instances of "Error validating task XXX: Task uses invalid slave: > SOME_UUID" > What do you think the problem is? I am copying the slave_id from the offer > into the TaskInfo protobuf. > > This will happen if the slave id in the task doesn't match the slave id in the slave. Are you sure you are doing the copying the right slave ids to the right tasks? Looks like there is a mismatch. Maybe some logs/printfs on your scheduler, when you launch tasks, can point out the issue. > I'm using the process-based isolation at the moment (I haven't had the time > to set up the cgroups isolation yet). > > I can find and share whatever else is needed so that we can figure out why > these messages are occurring. > > Thanks, > David > > > On Fri, May 17, 2013 at 5:16 PM, Vinod Kone wrote: > > > Hi David, > > > > You are right in that all these status updates are what we call > "terminal" > > status updates and mesos takes specific actions when it gets/generates > one > > of these. > > > > TASK_LOST is special in the sense that is not generated by the executor, > > but by the slave/master. You could think of it as an exception in mesos. > > Clearly, these should be rare in a stable mesos system. > > > > What do your logs say about the TASK_LOSTs? Is it always the same issue? > > Are you running w/ cgroups? > > > > > > > > On Fri, May 17, 2013 at 2:04 PM, David Greenberg > >wrote: > > > > > Hello! Today I began working on a more advanced version of mesos-submit > > > that will handle hot-spares. > > > > > > I was assuming that TASK_{FAILED,FINISHED,LOST,KILLED} were the status > > > updates that meant that I needed to start a new spare process, as the > > > monitored task was killed. However, I noticed that I often recieved > > > TASK_LOSTs, and every 5 seconds, my scheduler would think its tasks had > > all > > > died, so it'd restart too many. Nevertheless, the tasks would reappear > > > later on, and I could see them in the web interface of Mesos, > continuing > > to > > > run. > > > > > > What is going on? > > > > > > Thanks! > > > David > > > > > > --bcaec53962b07920c304dd65a514--