airavata-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raminder Singh <raminderjsi...@gmail.com>
Subject Re: Experiment Cancellation
Date Wed, 13 Aug 2014 15:25:08 GMT
We can’t depend on queue status as its different for different machine and none of the machine
give the queue status as job was canceled (see examples below). As Airavata is managing the
job and got the cancel request from user, Airavata should mark the job status to cancel along
with task and experiment status on a successful attempt. In case of job got canceled in queued
state, we don’t have stdout/error and in running state stdout/error will not have any detail
that job was canceled.  As we discussed, when we are successfully able to cancel the job,
we should mark the job status canceled and stop monitoring the job. In case of ultrascan,
we don’t want to run output handers. We can have other gateways with requirement to get
output some outputs and can be handled with a API flag. According to my understanding simple
workflow steps are. Please add more to this if i missed anything.  

1. User calls job cancel with intermediate outputs false
2. Validator check the current status
	2.A.
		1 if status executing then it calls job cancel function from orchestrator 
		2 On success we remove the job from the queue viewer or mark the status canceled
		3 In job status canceled and flag false we don’t call out handler
		4 Incase intermediate flag true search or stdout/error   
 
	2.B if any other status API return exception that operation not allowed                 
              														 																																									
Thanks
Raminder

Trestles >> 
[us3@trestles-login1 ~]$ qstat -u us3

trestles-fe1.local:
                                                                                  Req'd  
 Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory 
 Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ ---------
- ---------
2242884.trestles-fe1.l  us3         shared   A1613788797         --      2     64    --  
00:30:00 Q       --
[us3@trestles-login1 ~]$ qdel 2242884
[us3@trestles-login1 ~]$ qstat -u us3

trestles-fe1.local:
                                                                                  Req'd  
 Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory 
 Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ ---------
- ---------
2242884.trestles-fe1.l  us3         shared   A1613788797           0     2     64    --  
00:30:00 R  00:00:05

[us3@trestles-login1 ~]$ qstat -u us3

trestles-fe1.local:
                                                                                  Req'd  
 Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory 
 Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ ---------
- ---------
2242884.trestles-fe1.l  us3         shared   A1613788797       10302     2     64    --  
00:30:00 C       --


Stampede >>
us3@login4.stampede ~ $ squeue -u us3
             JOBID   PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           3897023      normal A8020068      us3 PD       0:00      2 (Priority)
us3@login4.stampede ~ $ scancel 3897023
us3@login4.stampede ~ $ squeue -u us3
             JOBID   PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

Lonestar >>
us3@lonestar ~ $ qstat
job-ID  prior   name       user         state submit/start at     queue                  
       slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
2109621 0.00000 A619522656 us3          qw    08/13/2014 09:44:43                        
          24
us3@lonestar ~ $ qdel 2109621
us3 has deleted job 2109621
us3@lonestar ~ $ qstat
us3@lonestar ~ $

Alamo >>
us3@alamo ~ $ qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
193052.alamo              1967556229       us3                    0 R default
us3@alamo ~ $ qdel 193052
us3@alamo ~ $ qstat
us3@alamo ~ $


On Aug 13, 2014, at 9:01 AM, Marlon Pierce <marpierc@iu.edu> wrote:

> There is an advantage for task (or job) state to capture the information that really
comes from the machine (completed, cancelled, failed, etc), and for experiment state to be
set to canceled by Airavata.  That is, there should be parts of Airavata that capture machine-specific
state information about the job for logging/auditing purposes.
> 
> * Airavata issues "cancel" command to job in "launched" or "executing" state.
> 
> * Airavata confirms that the job has left the queue or is no longer executing. This could
be machine-specific, but the main question is "has the job left the queue?" or "is the job
no longer in executing state?"  I don't think it is "if this is trestles, and since we issued
a qdel command, is the job marked as completed; of if this is stampede, is the job now marked
as failed?"
> 
> * If the job cancel works, the Airavata marks this as canceled.
> 
> * If cancel fails for some reason, don't change the Experiment state but throw an error.
> 
> 
> Marlon
> 
> On 8/13/14, 2:57 AM, Lahiru Gunathilake wrote:
>> Hi All,
>> 
>> I have few concerns about experiment cancellation. When we want to cancel
>> and experiment we have to run a particular command in the computing
>> resource. Based on the computing resource different resources show the job
>> status of the cancelled jobs in a different way. Ex: trestles shows the
>> cancelled jobs as completed, some other machines show it as as cancelled,
>> some might show it as failed.
>> 
>> I think we should replicated this information in the JobDetails object as
>> the Job status and make sure the Experiments and Task statuses as
>> cancelled. The other approach is when we cancel we explicitly make all the
>> states in the experiment model (experiments,tasks,job states as cancelled)
>> as cancelled and manually handle the state we get from the computing
>> resource.
>> 
>> My concerns should we really hide that information shown in the computing
>> resource from the Job status we are storing in to the registry ? or leave
>> it as it is and handle other statuses to represent the cancelled
>> experiments ? If we make everything cancel there will be inconsistency in
>> the JobStatus.
>> 
>> WDYT ?
>> 
>> Lahiru
>> 
> 


Mime
View raw message