Return-Path: X-Original-To: apmail-mesos-dev-archive@www.apache.org Delivered-To: apmail-mesos-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 668C817815 for ; Tue, 3 Feb 2015 15:30:35 +0000 (UTC) Received: (qmail 50467 invoked by uid 500); 3 Feb 2015 15:30:36 -0000 Delivered-To: apmail-mesos-dev-archive@mesos.apache.org Received: (qmail 50401 invoked by uid 500); 3 Feb 2015 15:30:36 -0000 Mailing-List: contact dev-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list dev@mesos.apache.org Received: (qmail 49978 invoked by uid 99); 3 Feb 2015 15:30:34 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Feb 2015 15:30:34 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of adam@mesosphere.io designates 209.85.215.71 as permitted sender) Received: from [209.85.215.71] (HELO mail-la0-f71.google.com) (209.85.215.71) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Feb 2015 15:30:30 +0000 Received: by mail-la0-f71.google.com with SMTP id s18so21442889lam.2 for ; Tue, 03 Feb 2015 07:28:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mesosphere.io; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=rYQf2Ob3CxVWZryVXV9d0LQp1THXGmYwKy8AIBtYeWM=; b=qbbn1v/m1HT66CNHg9CE7sfP4n6DvpbaEpuS303NNYzGVk/t1YwFLwYZX2xzN7ge9U TnRLhwApL6f3HTg/G7hNZ97xEmw0SPKcD0EoFmpeOrCsHUbqGbVLuIQqc3LB0tTebOVm eZmO/bnZukqxArnckOsygYBYb1M7HiUd7VESE= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=rYQf2Ob3CxVWZryVXV9d0LQp1THXGmYwKy8AIBtYeWM=; b=gt96BjJS48L7TmotoEWwqcTdM9ynhBhBYlm0jn0xm7wzVXsB7hVrmuTIVG126Y+oY0 50oOtHIP3SyMOfUG2kIgQryeknLy8E17IE0tgSK8wbXcEE5XNOIpJ+O2RGdeBk9Tccnp 3PhuNwQKtwkKyAua3UbpqrRDswzDy0VJxuGTzOXLgYsD7vFWb52xV6jRz+AUaFax4KLc rd21udxoKimfpNUEXazmSy3cB6m5J5HI775g8LCsoi4FGfVUDOHLHATz88jQwXOKQj77 9eTa9AdbbOv0HmaE3V5II9jK9d9QUVvBLqApBkuuRqotQTtV0N2l2RBc2OrPTV32S2I2 E6MQ== X-Gm-Message-State: ALoCoQm5rGro71DYdQRQK1w+Jpmre6rJF4H/NUKaRbqOfe0SAkfslUwD3ne9dYUzSX39o14vWPNW MIME-Version: 1.0 X-Received: by 10.112.37.197 with SMTP id a5mr8329962lbk.19.1422977317539; Tue, 03 Feb 2015 07:28:37 -0800 (PST) Received: by 10.25.30.2 with HTTP; Tue, 3 Feb 2015 07:28:37 -0800 (PST) In-Reply-To: <20150203134624.GA7322@chengwei-debian.qiyi.com> References: <20150203134624.GA7322@chengwei-debian.qiyi.com> Date: Tue, 3 Feb 2015 07:28:37 -0800 Message-ID: Subject: Re: When does scheduler driver send the LaunchTasksMessage From: Adam Bordelon To: dev Cc: Chengwei Yang Content-Type: multipart/alternative; boundary=001a11347340892291050e30b9c6 X-Virus-Checked: Checked by ClamAV on apache.org --001a11347340892291050e30b9c6 Content-Type: text/plain; charset=UTF-8 Just make sure you only send one LaunchTasksMessage per slave, although that message could contain multiple tasks launched on a collection of offers from the same slave. You mention that launching 1000s in the same message causes Mesos to crash. Do you have a crash stack available for this? You shouldn't have to respond to all offers received before tasks get launched. Some frameworks "hoard" offers in case they want to launch something on them later, but launch other tasks in the meantime. Perhaps the delay has something to do with Chronos' cron-like scheduling feature? On Tue, Feb 3, 2015 at 5:46 AM, Chengwei Yang wrote: > Hi List, > > We are running chronos on mesos 0.19.0 and found a interesting problem, > that if > we were trying to launch about 1k tasks in a single resourceOffers(), it > may crash > and no tasks started by mesos at all. > > So we did a test like below: > > change code in chronos resourceOffers() callback as below: > > 1. print log > 2. decline the first offer in bunch of offers > 3. sleep 30 seconds > 4. decline all the offers received > > add a log in src/master/master.cpp to print some log whenever received a > LaunchTasksMessage, see below log. > > -----------8<----------------------- > I0203 18:32:33.169342 7680 master.cpp:2939] Sending 3 offers to framework > 20150203-174243-2487817994-5050-10996-0000 > I0203 18:32:39.523227 7670 http.cpp:452] HTTP request for > '/master/state.json' > I0203 18:32:49.601284 7674 http.cpp:452] HTTP request for > '/master/state.json' > I0203 18:32:59.677875 7677 http.cpp:452] HTTP request for > '/master/state.json' > I0203 18:33:03.390188 7676 master.cpp:1754] Received launchTasks message > for offer [ 20150203-183014-2487817994-5050-7668-0 ] of framework > 20150203-174243-2487817994-5050-10996-0000 > I0203 18:33:03.390949 7676 master.cpp:1895] Processing reply for offers: > [ 20150203-183014-2487817994-5050-7668-0 ] on slave > 20150203-183014-2487817994-5050-7668-2 at slave(1)@10.23.73.140:5051 > (xulijian-mesos-online016-cqdx.qiyi.virtual) for framework > 20150203-174243-2487817994-5050-10996-0000 > I0203 18:33:03.391469 7676 master.cpp:1754] Received launchTasks message > for offer [ 20150203-183014-2487817994-5050-7668-0 ] of framework > 20150203-174243-2487817994-5050-10996-0000 > I0203 18:33:03.391791 7670 hierarchical_allocator_process.hpp:589] > Framework 20150203-174243-2487817994-5050-10996-0000 filtered slave > 20150203-183014-2487817994-5050-7668-2 for 5secs > W0203 18:33:03.392019 7676 master.cpp:1871] Failed to validate offer > 20150203-183014-2487817994-5050-7668-0: Offer > 20150203-183014-2487817994-5050-7668-0 is no longer valid > I0203 18:33:03.393173 7676 master.cpp:1754] Received launchTasks message > for offer [ 20150203-183014-2487817994-5050-7668-1 ] of framework > 20150203-174243-2487817994-5050-10996-0000 > I0203 18:33:03.393601 7676 master.cpp:1895] Processing reply for offers: > [ 20150203-183014-2487817994-5050-7668-1 ] on slave > 20150203-183014-2487817994-5050-7668-1 at slave(1)@10.23.73.141:5051 > (xulijian-mesos-online017-cqdx.qiyi.virtual) for framework > 20150203-174243-2487817994-5050-10996-0000 > I0203 18:33:03.394057 7676 master.cpp:1754] Received launchTasks message > for offer [ 20150203-183014-2487817994-5050-7668-2 ] of framework > 20150203-174243-2487817994-5050-10996-0000 > I0203 18:33:03.394379 7679 hierarchical_allocator_process.hpp:589] > Framework 20150203-174243-2487817994-5050-10996-0000 filtered slave > 20150203-183014-2487817994-5050-7668-1 for 5secs > I0203 18:33:03.394664 7676 master.cpp:1895] Processing reply for offers: > [ 20150203-183014-2487817994-5050-7668-2 ] on slave > 20150203-183014-2487817994-5050-7668-0 at slave(1)@10.23.73.148:5051 > (xulijian-mesos-online015-cqdx.qiyi.virtual) for framework > 20150203-174243-2487817994-5050-10996-0000 > I0203 18:33:03.395504 7676 hierarchical_allocator_process.hpp:589] > Framework 20150203-174243-2487817994-5050-10996-0000 filtered slave > 20150203-183014-2487817994-5050-7668-0 for 5secs > ---------------8<------------------- > > As we can see, mesos-master send offer to chronos at 18:32:33, but > received all > 4 decline message (LaunchTasksMessage) at 18:33.03, we are very curious > why the > first decline doesn't sent before sleep 30 seconds? > > From the log, we see that the offer 0 is no longer valid because we > already send > a decline before. > > Does that mean we(the framework scheduler) have to reply for all offers > received > before we can launch any task? > > -- > Thanks, > Chengwei > --001a11347340892291050e30b9c6--