Return-Path: X-Original-To: apmail-airavata-dev-archive@www.apache.org Delivered-To: apmail-airavata-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DD1DB11196 for ; Wed, 13 Aug 2014 15:25:37 +0000 (UTC) Received: (qmail 35117 invoked by uid 500); 13 Aug 2014 15:25:37 -0000 Delivered-To: apmail-airavata-dev-archive@airavata.apache.org Received: (qmail 35060 invoked by uid 500); 13 Aug 2014 15:25:37 -0000 Mailing-List: contact dev-help@airavata.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@airavata.apache.org Delivered-To: mailing list dev@airavata.apache.org Received: (qmail 35048 invoked by uid 99); 13 Aug 2014 15:25:37 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Aug 2014 15:25:37 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of raminderjsingh@gmail.com designates 209.85.213.41 as permitted sender) Received: from [209.85.213.41] (HELO mail-yh0-f41.google.com) (209.85.213.41) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Aug 2014 15:25:11 +0000 Received: by mail-yh0-f41.google.com with SMTP id b6so8913576yha.0 for ; Wed, 13 Aug 2014 08:25:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:content-type:message-id:mime-version:subject:date:references :to:in-reply-to; bh=mlv1mBovRScty0YCoNribfC5RI4sKARufHPvcvXBcyY=; b=MuoxY9/6PrzrUBfkncl9sI7dB+Wjx+zIM+G/It+wZmJ2IjRHC2PRFIrEVuUYsavBkA 66Wu+t9PqKBJ9foHiRhgBkDqNnj8AK/E8BKJhef96IAO4d6Ry7HPlnM7QMXoY8zzKoby tvvqNGiWXysJQcOOS7Actlx74SIJyW33gHt00fZCcEBEXiq5gaxnFArpyJ2/Ypf9PcJ1 L2fMShhSoYkAvVywB2azvDNAOJ5fDATt3Voj4qJAmDzYtoX4OBTgnpPIUQUQ5cNnELj+ 6IIJaHT1kqcp4uYfXVEkUPGiB68wt3duKR1I3M8bZ1AraMka4WZQAeOygGh/hnzC5fUA GY4A== X-Received: by 10.236.136.168 with SMTP id w28mr7886264yhi.132.1407943509516; Wed, 13 Aug 2014 08:25:09 -0700 (PDT) Received: from ?IPv6:2001:18e8:2:28c6:f000::e885? ([2001:18e8:2:28c6:f000::e885]) by mx.google.com with ESMTPSA id a57sm4561980yha.52.2014.08.13.08.25.08 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 13 Aug 2014 08:25:09 -0700 (PDT) From: Raminder Singh Content-Type: multipart/alternative; boundary="Apple-Mail=_231F8D88-CC79-491F-94A9-4E8F9AF1E7F2" Message-Id: <59B49733-939A-4890-BC3E-A35B817243F8@gmail.com> Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: Experiment Cancellation Date: Wed, 13 Aug 2014 11:25:08 -0400 References: <53EB618F.2040504@iu.edu> To: dev@airavata.apache.org In-Reply-To: <53EB618F.2040504@iu.edu> X-Mailer: Apple Mail (2.1878.6) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_231F8D88-CC79-491F-94A9-4E8F9AF1E7F2 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=windows-1252 We can=92t depend on queue status as its different for different machine = and none of the machine give the queue status as job was canceled (see = examples below). As Airavata is managing the job and got the cancel = request from user, Airavata should mark the job status to cancel along = with task and experiment status on a successful attempt. In case of job = got canceled in queued state, we don=92t have stdout/error and in = running state stdout/error will not have any detail that job was = canceled. As we discussed, when we are successfully able to cancel the = job, we should mark the job status canceled and stop monitoring the job. = In case of ultrascan, we don=92t want to run output handers. We can have = other gateways with requirement to get output some outputs and can be = handled with a API flag. According to my understanding simple workflow = steps are. Please add more to this if i missed anything. =20 1. User calls job cancel with intermediate outputs false 2. Validator check the current status 2.A. 1 if status executing then it calls job cancel function = from orchestrator=20 2 On success we remove the job from the queue viewer or = mark the status canceled 3 In job status canceled and flag false we don=92t call = out handler 4 Incase intermediate flag true search or stdout/error =20= =20 2.B if any other status API return exception that operation not = allowed = = = = = =09 Thanks Raminder Trestles >>=20 [us3@trestles-login1 ~]$ qstat -u us3 trestles-fe1.local: = Req'd Req'd Elap Job ID Username Queue Jobname SessID = NDS TSK Memory Time S Time ----------------------- ----------- -------- ---------------- ------ = ----- ------ ------ --------- - --------- 2242884.trestles-fe1.l us3 shared A1613788797 -- = 2 64 -- 00:30:00 Q -- [us3@trestles-login1 ~]$ qdel 2242884 [us3@trestles-login1 ~]$ qstat -u us3 trestles-fe1.local: = Req'd Req'd Elap Job ID Username Queue Jobname SessID = NDS TSK Memory Time S Time ----------------------- ----------- -------- ---------------- ------ = ----- ------ ------ --------- - --------- 2242884.trestles-fe1.l us3 shared A1613788797 0 = 2 64 -- 00:30:00 R 00:00:05 [us3@trestles-login1 ~]$ qstat -u us3 trestles-fe1.local: = Req'd Req'd Elap Job ID Username Queue Jobname SessID = NDS TSK Memory Time S Time ----------------------- ----------- -------- ---------------- ------ = ----- ------ ------ --------- - --------- 2242884.trestles-fe1.l us3 shared A1613788797 10302 = 2 64 -- 00:30:00 C -- Stampede >> us3@login4.stampede ~ $ squeue -u us3 JOBID PARTITION NAME USER ST TIME NODES = NODELIST(REASON) 3897023 normal A8020068 us3 PD 0:00 2 = (Priority) us3@login4.stampede ~ $ scancel 3897023 us3@login4.stampede ~ $ squeue -u us3 JOBID PARTITION NAME USER ST TIME NODES = NODELIST(REASON) Lonestar >> us3@lonestar ~ $ qstat job-ID prior name user state submit/start at queue = slots ja-task-ID = --------------------------------------------------------------------------= --------------------------------------- 2109621 0.00000 A619522656 us3 qw 08/13/2014 09:44:43 = 24 us3@lonestar ~ $ qdel 2109621 us3 has deleted job 2109621 us3@lonestar ~ $ qstat us3@lonestar ~ $ Alamo >> us3@alamo ~ $ qstat Job id Name User Time Use S = Queue ------------------------- ---------------- --------------- -------- - = ----- 193052.alamo 1967556229 us3 0 R = default us3@alamo ~ $ qdel 193052 us3@alamo ~ $ qstat us3@alamo ~ $ On Aug 13, 2014, at 9:01 AM, Marlon Pierce wrote: > There is an advantage for task (or job) state to capture the = information that really comes from the machine (completed, cancelled, = failed, etc), and for experiment state to be set to canceled by = Airavata. That is, there should be parts of Airavata that capture = machine-specific state information about the job for logging/auditing = purposes. >=20 > * Airavata issues "cancel" command to job in "launched" or "executing" = state. >=20 > * Airavata confirms that the job has left the queue or is no longer = executing. This could be machine-specific, but the main question is "has = the job left the queue?" or "is the job no longer in executing state?" = I don't think it is "if this is trestles, and since we issued a qdel = command, is the job marked as completed; of if this is stampede, is the = job now marked as failed?" >=20 > * If the job cancel works, the Airavata marks this as canceled. >=20 > * If cancel fails for some reason, don't change the Experiment state = but throw an error. >=20 >=20 > Marlon >=20 > On 8/13/14, 2:57 AM, Lahiru Gunathilake wrote: >> Hi All, >>=20 >> I have few concerns about experiment cancellation. When we want to = cancel >> and experiment we have to run a particular command in the computing >> resource. Based on the computing resource different resources show = the job >> status of the cancelled jobs in a different way. Ex: trestles shows = the >> cancelled jobs as completed, some other machines show it as as = cancelled, >> some might show it as failed. >>=20 >> I think we should replicated this information in the JobDetails = object as >> the Job status and make sure the Experiments and Task statuses as >> cancelled. The other approach is when we cancel we explicitly make = all the >> states in the experiment model (experiments,tasks,job states as = cancelled) >> as cancelled and manually handle the state we get from the computing >> resource. >>=20 >> My concerns should we really hide that information shown in the = computing >> resource from the Job status we are storing in to the registry ? or = leave >> it as it is and handle other statuses to represent the cancelled >> experiments ? If we make everything cancel there will be = inconsistency in >> the JobStatus. >>=20 >> WDYT ? >>=20 >> Lahiru >>=20 >=20 --Apple-Mail=_231F8D88-CC79-491F-94A9-4E8F9AF1E7F2 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=windows-1252
We can=92t depend on queue status as its = different for different machine and none of the machine give the queue = status as job was canceled (see examples below). As Airavata is managing = the job and got the cancel request from user, Airavata should mark the = job status to cancel along with task and experiment status on a = successful attempt. In case of job got canceled in queued state, we = don=92t have stdout/error and in running state stdout/error will not = have any detail that job was canceled.  As we discussed, when we = are successfully able to cancel the job, we should mark the job status = canceled and stop monitoring the job. In case of ultrascan, we don=92t = want to run output handers. We can have other gateways with requirement = to get output some outputs and can be handled with a API flag. According = to my understanding simple workflow steps are. Please add more to this = if i missed anything.  

1. User calls job = cancel with intermediate outputs false
2. Validator check the = current status
2.A.
= 1 if status executing then it calls job cancel function from = orchestrator 
2 On success we = remove the job from the queue viewer or mark the status = canceled
3 In job status = canceled and flag false we don=92t call out handler
4 = Incase intermediate flag true search or stdout/error =   
 
2.B if any other status API = return exception that operation not allowed         =                     =     = =   = = = = =
Thanks
Raminder

Tr= estles >> 
[us3@trestles-login1 ~]$ qstat = -u us3

trestles-fe1.local:
  =                     =                     =                     =                     = Req'd    Req'd       Elap
Job ID =                  Username =    Queue    Jobname         =  SessID  NDS   TSK   Memory   Time   =  S   Time
----------------------- ----------- = -------- ---------------- ------ ----- ------ ------ --------- - = ---------
2242884.trestles-fe1.l  us3     =     shared   A1613788797         -- =      2     64    --   00:30:00 Q =       --
[us3@trestles-login1 ~]$ qdel = 2242884
[us3@trestles-login1 ~]$ qstat -u = us3

trestles-fe1.local:
    =                     =                     =                     =                   Req'd =    Req'd       Elap
Job ID   =                Username   =  Queue    Jobname         =  SessID  NDS   TSK   Memory   Time   =  S   Time
----------------------- ----------- = -------- ---------------- ------ ----- ------ ------ --------- - = ---------
2242884.trestles-fe1.l  us3     =     shared   A1613788797         =   0     2     64    --   = 00:30:00 R =  00:00:05

[us3@trestles-login1 = ~]$ qstat -u = us3

trestles-fe1.local:
    =                     =                     =                     =                   Req'd =    Req'd       Elap
Job ID   =                Username   =  Queue    Jobname         =  SessID  NDS   TSK   Memory   Time   =  S   Time
----------------------- ----------- = -------- ---------------- ------ ----- ------ ------ --------- - = ---------
2242884.trestles-fe1.l  us3     =     shared   A1613788797       10302 =     2     64    --   00:30:00 C =       = --


Stampede = >>
us3@login4.stampede ~ $ squeue = -u us3
             JOBID =   PARTITION     NAME     USER ST     =   TIME  NODES NODELIST(REASON)
      =      3897023      normal A8020068   =    us3 PD       0:00      2 = (Priority)
us3@login4.stampede ~ $ scancel = 3897023
us3@login4.stampede ~ $ squeue = -u us3
             JOBID =   PARTITION     NAME     USER ST     =   TIME  NODES = NODELIST(REASON)

Lonestar = >>
us3@lonestar ~ $ qstat
job-ID =  prior   name       user       =   state submit/start at     queue       =                   =  slots = ja-task-ID
-----------------------------------------------------= ------------------------------------------------------------
210= 9621 0.00000 A619522656 us3          qw   =  08/13/2014 09:44:43             =                     =   24
us3@lonestar ~ $ qdel 2109621
us3 has = deleted job 2109621
us3@lonestar ~ $ = qstat
us3@lonestar ~ = $

Alamo = >>
us3@alamo ~ $ qstat
Job id   =                  Name =             User       =      Time Use S Queue
------------------------- = ---------------- --------------- -------- - -----
193052.alamo =              1967556229     =   us3                 =    0 R default
us3@alamo ~ $ qdel = 193052
us3@alamo ~ $ qstat
us3@alamo ~ = $


On Aug 13, 2014, at = 9:01 AM, Marlon Pierce <marpierc@iu.edu> wrote:

There is = an advantage for task (or job) state to capture the information that = really comes from the machine (completed, cancelled, failed, etc), and = for experiment state to be set to canceled by Airavata.  That is, = there should be parts of Airavata that capture machine-specific state = information about the job for logging/auditing purposes.

* = Airavata issues "cancel" command to job in "launched" or "executing" = state.

* Airavata confirms that the job has left the queue or is = no longer executing. This could be machine-specific, but the main = question is "has the job left the queue?" or "is the job no longer in = executing state?"  I don't think it is "if this is trestles, and = since we issued a qdel command, is the job marked as completed; of if = this is stampede, is the job now marked as failed?"

* If the job = cancel works, the Airavata marks this as canceled.

* If cancel = fails for some reason, don't change the Experiment state but throw an = error.


Marlon

On 8/13/14, 2:57 AM, Lahiru Gunathilake = wrote:
Hi All,

I have few concerns = about experiment cancellation. When we want to cancel
and experiment = we have to run a particular command in the computing
resource. Based = on the computing resource different resources show the job
status of = the cancelled jobs in a different way. Ex: trestles shows = the
cancelled jobs as completed, some other machines show it as as = cancelled,
some might show it as failed.

I think we should = replicated this information in the JobDetails object as
the Job = status and make sure the Experiments and Task statuses as
cancelled. = The other approach is when we cancel we explicitly make all = the
states in the experiment model (experiments,tasks,job states as = cancelled)
as cancelled and manually handle the state we get from the = computing
resource.

My concerns should we really hide that = information shown in the computing
resource from the Job status we = are storing in to the registry ? or leave
it as it is and handle = other statuses to represent the cancelled
experiments ? If we make = everything cancel there will be inconsistency in
the = JobStatus.

WDYT = ?

Lahiru



= --Apple-Mail=_231F8D88-CC79-491F-94A9-4E8F9AF1E7F2--