Return-Path: X-Original-To: apmail-incubator-mesos-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-mesos-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 81FF19580 for ; Fri, 20 Apr 2012 08:19:07 +0000 (UTC) Received: (qmail 57304 invoked by uid 500); 20 Apr 2012 08:19:07 -0000 Delivered-To: apmail-incubator-mesos-dev-archive@incubator.apache.org Received: (qmail 57267 invoked by uid 500); 20 Apr 2012 08:19:07 -0000 Mailing-List: contact mesos-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mesos-dev@incubator.apache.org Delivered-To: mailing list mesos-dev@incubator.apache.org Received: (qmail 57224 invoked by uid 99); 20 Apr 2012 08:19:06 -0000 Received: from reviews-vm.apache.org (HELO reviews.apache.org) (140.211.11.40) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Apr 2012 08:19:06 +0000 Received: from reviews.apache.org (localhost [127.0.0.1]) by reviews.apache.org (Postfix) with ESMTP id B4D431C3852; Fri, 20 Apr 2012 08:19:05 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============5964255912928796786==" MIME-Version: 1.0 Subject: Re: Review Request: Updates and additions to the MPI framework From: "Harvey Feng" To: "Benjamin Hindman" , "Charles Reiss" Date: Fri, 20 Apr 2012 08:19:05 -0000 Message-ID: <20120420081905.2530.26847@reviews.apache.org> X-ReviewBoard-URL: https://reviews.apache.org X-ReviewRequest-URL: https://reviews.apache.org/r/4768/ Cc: "Harvey Feng" , "mesos" In-Reply-To: <20120418054137.2531.9446@reviews.apache.org> References: <20120418054137.2531.9446@reviews.apache.org> --===============5964255912928796786== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable > On 2012-04-18 05:41:37, Charles Reiss wrote: > > frameworks/mpi/README.txt, line 11 > > > > > > mpd was deprecated? What's the current alternative? I think the new versions use the Hydra process manager, so 'mpiexec' would = be the only command needed to launch an MPI program. = > On 2012-04-18 05:41:37, Charles Reiss wrote: > > frameworks/mpi/nmpiexec.py, line 22 > > > > > > Remove or comment this debugging. done. > On 2012-04-18 05:41:37, Charles Reiss wrote: > > frameworks/mpi/startmpd.py, line 83 > > > > > > Use os.kill instead (and above). done. > On 2012-04-18 05:41:37, Charles Reiss wrote: > > frameworks/mpi/startmpd.py, line 56 > > > > > > Can we use MPD's exit status to determine when to send TASK_FAILED = or TASK_KILLED? ok, fixed that. > On 2012-04-18 05:41:37, Charles Reiss wrote: > > frameworks/mpi/startmpd.py, line 15 > > > > > > I think we can get rid of this entirely; it's clearly wrong in the = case where multiple MPIs are running, and we should be tracking stray proce= sses so we eventually kill them if MPD doesn't do something funny. (And if = it does, we should figure out how to disable that.) ok - shutdown() should remove any stray processes left over. > On 2012-04-18 05:41:37, Charles Reiss wrote: > > frameworks/mpi/nmpiexec.py, line 210 > > > > > > Let's try a name that doesn't contain test or Python and will give = a hint when multiple instances are running, like something using MPI_TASK. changed to 'MPI: ' + MPI_TASK, and added a --name option > On 2012-04-18 05:41:37, Charles Reiss wrote: > > frameworks/mpi/nmpiexec.py, line 95 > > > > > > Remove trailing whitespace. done > On 2012-04-18 05:41:37, Charles Reiss wrote: > > frameworks/mpi/nmpiexec.py, line 31 > > > > > > Can we avoid using the shell here (and having MPI_TASK be interpret= ed by the shell twice)? ok > On 2012-04-18 05:41:37, Charles Reiss wrote: > > frameworks/mpi/README.txt, line 37 > > > > > > We should probably support taking the path to these binaries an opt= ion passed automatically to the executor (e.g. through an environment varia= ble option) to avoid PATH issues. ok. Passes the directory to mpi binaries using the executor's CommandInfo - Harvey ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/4768/#review6999 ----------------------------------------------------------- On 2012-04-20 08:17:57, Harvey Feng wrote: > = > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/4768/ > ----------------------------------------------------------- > = > (Updated 2012-04-20 08:17:57) > = > = > Review request for mesos, Benjamin Hindman and Charles Reiss. > = > = > Summary > ------- > = > Some updates to point out: > = > -nmpiexec.py > -> 'mpdallexit' should terminate all slaves' mpds in the ring. I moved = 'driver.stop()' to statusUpdate() so that it stops when all tasks have been= finished, which occurs when the executor's launched mpd processes have all= exited. = > -startmpd.py > -> Didn't remove cleanup(), and added code in shutdown() that manually = kills mpd processes. They might be useful during abnormal (cleanup) and nor= mal (shutdown) framework/executor termination...I think. cleanup() still te= rminates all mpd's in the slave, but shutdown doesn't. = > -> killtask() stops the mpd associated with the given tid. = > -> Task states update nicely now. They correspond to the state of a tas= k's associated mpd process. > -Readme > -> Included additional info on how to setup and run MPICH2 1.2 and nmpi= exec on OS X and Ubuntu/Linux > = > = > This addresses bug MESOS-183. > https://issues.apache.org/jira/browse/MESOS-183 > = > = > Diffs > ----- > = > frameworks/mpi/README.txt cdb4553 = > frameworks/mpi/nmpiexec.py a5db9c0 = > frameworks/mpi/startmpd.py 8eeba5e = > = > Diff: https://reviews.apache.org/r/4768/diff > = > = > Testing > ------- > = > = > Thanks, > = > Harvey > = > --===============5964255912928796786==--