Return-Path: X-Original-To: apmail-mesos-user-archive@www.apache.org Delivered-To: apmail-mesos-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 29DCE117E3 for ; Tue, 1 Jul 2014 17:04:26 +0000 (UTC) Received: (qmail 7265 invoked by uid 500); 1 Jul 2014 17:04:25 -0000 Delivered-To: apmail-mesos-user-archive@mesos.apache.org Received: (qmail 7226 invoked by uid 500); 1 Jul 2014 17:04:25 -0000 Mailing-List: contact user-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mesos.apache.org Delivered-To: mailing list user@mesos.apache.org Received: (qmail 7213 invoked by uid 99); 1 Jul 2014 17:04:25 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Jul 2014 17:04:25 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of spodila@netflix.com designates 209.85.216.176 as permitted sender) Received: from [209.85.216.176] (HELO mail-qc0-f176.google.com) (209.85.216.176) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Jul 2014 17:04:23 +0000 Received: by mail-qc0-f176.google.com with SMTP id w7so8624148qcr.7 for ; Tue, 01 Jul 2014 10:03:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=netflix.com; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=mDFe7bCKHpKptBvEHvMyzRDbsCuKrmPV+ml80+j6scY=; b=QEm9I4VAHFGrmgeooYeZIG3QNb4NveU/W7PjbODKHeN+YdGzgf5UtZxIKXNQcN5b7J 5hnorXWWJr97nLU3DsbtwUTVvkYbkdttBlmSkW+O6X0QpeKYXb/zclVJcxl7Y1TIADMj RwK0UDlnCCArhjJYJMzLiCHKutpXtNPPUynsY= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=mDFe7bCKHpKptBvEHvMyzRDbsCuKrmPV+ml80+j6scY=; b=ZW6QHlloftuo33MfzuVA5sqf09we28irxFKSfxTz5RbpfUrdhywhtbmhaLfpPelZWo iyQ93ZCtijuvONla0u8MzrajBdl3Umv3Tr27kthUNJhwP5z49QW26qYmFXv+hoMpj8rW tJL/eXtAqbm0RlUhRjheKrPPXBM0R+b9x61YxhAM+nhPEYYTxbNrfTg/bP1wAFfb7Ff8 7Im6edply10o7mSIsQkjAoMG9bK80sh7Mc4cVYwjVVj2bufddroFCtSzbpkZ1wyIUdrn tv+FAhjsjm37kFOVQy1xMtGTq408w7UCWyh7wKobZOJeiLYl3Swkkqp+KZPxUxtbDhvz LEpw== X-Gm-Message-State: ALoCoQkn1jrdeYu2jL3raQB5Z8uxy/+PukO0/p2igWaMupd+yYKvyFoJDYDphgb2vjhqiRHFOYC6 MIME-Version: 1.0 X-Received: by 10.140.109.201 with SMTP id l67mr71348556qgf.72.1404234238219; Tue, 01 Jul 2014 10:03:58 -0700 (PDT) Received: by 10.229.27.74 with HTTP; Tue, 1 Jul 2014 10:03:58 -0700 (PDT) In-Reply-To: References: Date: Tue, 1 Jul 2014 10:03:58 -0700 Message-ID: Subject: Re: Task serialization per machine? From: Sharma Podila To: user@mesos.apache.org Content-Type: multipart/alternative; boundary=001a113a3740f36d1a04fd24c20e X-Virus-Checked: Checked by ClamAV on apache.org --001a113a3740f36d1a04fd24c20e Content-Type: text/plain; charset=UTF-8 Hi Asim, I am using (developing) a Java executor. I see a similar strategy in the Mesos-Hadoop executor. https://github.com/mesos/hadoop/blob/master/src/main/java/org/apache/hadoop/mapred/MesosExecutor.java Executor's successful launching of the task (asynchronously) is usually immediately followed by a TaskState.TASK_RUNNING status message to driver. It can then return from the launchTask method, but the executor process shouldn't exit, it will have to remain running for at least the duration of the task. Upon completion of the task, the executor must notify Mesos of its completion. A task lost status will be reported by Mesos if the executor were to exit pre-maturely. My explanation is from understanding Mesos as a user and framework developer. Someone from the Mesos dev team may have a better way to explain this. I suspect framework callbacks, at least at the executor, aren't done concurrently. I haven't looked in to the details of why/how/etc. On Tue, Jul 1, 2014 at 7:48 AM, Asim wrote: > Thanks for your response! > > Yes the executor (launchTask) only gets one task that it executes > synchronously and finishes. Since launchTask is a callback, my intuition > is the scheduler should launch these tasks in parallel (even within a > single machine) after calculating the resources required. I can create a > new thread in launchTask() callback and return immediately but that will > cause a lost slave since the scheduler assumes it is finished but there is > a zombie thread still around. Hence, I am not completely sure creating new > threads will solve this issue. > > I am using the C++ framework. Is there an example on how this is > accomplished in current frameworks? I looked at Spark and it does not seem > to be doing anything special for its callbacks to ensure that multiple > tasks on a single machine execute in parallel. > > Thanks, > Asim > > > > > > > > On Mon, Jun 30, 2014 at 4:48 PM, Sharma Podila > wrote: > >> A likely scenario is that your executor is running the task synchronously >> inside the callback to launchTask(). If you make it instead run the task >> asynchronously (e.g., in a separate thread), that should resolve it. >> >> >> On Mon, Jun 30, 2014 at 12:48 PM, Asim wrote: >> >>> Hi, >>> >>> I want to launch multiple tasks on multiple machines (t >> m) that can >>> run simultaneously. Currently, I find that every machine processes the >>> tasks in a serial fashion one after another. >>> >>> I have written a framework with a scheduler and a executor. The >>> scheduler launches a task list on a bunch of machines (that show up as >>> offers). When I send a task list to run >>> with driver->launchTasks(offers[i].id(), tasks[i]) I find that every >>> machine picks up one task at a time (and then goes to the next). This >>> happens even though the offer can accommodate more than one task from this >>> task list easily. >>> >>> Is there something that I am missing? >>> >>> Thanks, >>> Asim >>> >>> >> > --001a113a3740f36d1a04fd24c20e Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Asim,

I am using (developing) a Java executor. I see a similar strategy in the Me= sos-Hadoop executor.=C2=A0

Executor's successful launching of the task (asynchronously) is = usually immediately followed by a=C2=A0TaskState.TASK_RUNNING status message to driver. It can then return from the launchT= ask method, but the executor process shouldn't exit, it will have to re= main running for at least the duration of the task. Upon completion of the = task, the executor must notify Mesos of its completion. A task lost status = will be reported by Mesos if the executor were to exit pre-maturely.=C2=A0<= /span>

My explanation is from underst= anding Mesos as a user and framework developer. Someone from the Mesos dev = team may have a better way to explain this.=C2=A0
I suspect framework callbacks, at least at the executor, aren= 't done concurrently. I haven't looked in to the details of why/how= /etc.





On Tue, Jul 1, 2014 at = 7:48 AM, Asim <linkasim@gmail.com> wrote:
Thanks for your response!
Yes the executor (launchTask) only gets one task that it = executes synchronously and finishes. Since launchTask is a callback, my int= uition =C2=A0is the scheduler should launch these tasks in parallel (even w= ithin a single machine) after calculating the resources required. I can cre= ate a new thread in launchTask() callback and return immediately but that w= ill cause a lost slave since the scheduler assumes it is finished but there= is a zombie thread still around. Hence, I am not completely sure creating = new threads will solve this issue.

I am using the C++ framework. Is there an example on ho= w this is accomplished in current frameworks? =C2=A0I looked at Spark and i= t does not seem to be doing anything special for its callbacks to ensure th= at multiple tasks on a single machine execute in parallel.

Thanks,
Asim

<= br>





On Mon, Jun 30, 2014 at 4:48 PM, Sharma Podila <spodila@netflix.com&= gt; wrote:
A likely scenario is that you= r executor is running the task synchronously inside the callback to launchT= ask(). If you make it instead run the task asynchronously (e.g., in a separ= ate thread), that should resolve it.=C2=A0


On Mon, Jun 30, 2014 at 12:48 PM, Asim <linkasim@gmail.com> wrote:
Hi,

I want to launch multiple tasks on = multiple machines (t >> m) that can run simultaneously. Currently, I = find that every machine processes the tasks in a serial fashion one after a= nother.

I have written a framework with a scheduler and a execu= tor. The scheduler launches a task list on a bunch of machines (that show u= p as offers). When I send a task list to run with=C2=A0driver->launchTas= ks(offers[i].id(), tasks[i]) I find that every machine picks up one task at= a time (and then goes to the next). This happens even though the offer can= accommodate more than one task from this task list easily.

Is there something that I am missing?

Thanks,
Asim




--001a113a3740f36d1a04fd24c20e--