oodt-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (OODT-692) Use lsof to stop Workflow/Resource Manager task/job PIDs
Date Sun, 14 Sep 2014 22:42:34 GMT

     [ https://issues.apache.org/jira/browse/OODT-692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Chris A. Mattmann updated OODT-692:
    Fix Version/s:     (was: 0.7)

> Use lsof to stop Workflow/Resource Manager task/job PIDs 
> ---------------------------------------------------------
>                 Key: OODT-692
>                 URL: https://issues.apache.org/jira/browse/OODT-692
>             Project: OODT
>          Issue Type: Bug
>          Components: pge wrapper framework, resource manager, workflow manager
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>              Labels: killjob, manager, oodt, pid, resource, unix, workflow
>             Fix For: 0.8
> We can exploit a combination of LSOF, JobDir, and WorkflowInstanceId to actually kill
the process ID and fully stop a job kicked off by the resource manager and workflow manager.
I've been testing this process by hand on the ASO process and it's totally useable by hand
in practice, so we should automate it. For example:
> {noformat}
> [snowdeploy@trango-private bin]$ lsof -p 37558
> idl     37558 snowdeploy  cwd    DIR    253,2     4096 488284165 /data/jobs/CASI/ISSP/20140511f1_184151_1399903013836
> ..
> {noformat}
> Reveals to use that the process ID 37558 (one of the IDL jobs running in ASO for the
ORTHO process) corresponds to _JobDir_ 
> {noformat}
> /data/jobs/CASI/ISSP/20140511f1_184151_1399903013836
> {noformat}
> We can also find out from WorklowInstanceMetadata that the _JobDir_ corresponding to
the line _184151_ is _726af17c-c131-4682-845e-4ef6b4a7eeee_.
> So, from a Workflow Instance Id, we need:
> # the resolved JobDir by CAS-PGE. If it's not a CAS-PGE job, we need the WorkflowTask
to specify a JobDir, or else this functionality will simply print out a message saying Kill
without JobDir not supported.
> # a map for processes to interrogate with lsof e.g., PCS_JobKillProcessName
> # the use of lsof to interrogate the PID table, find the job corresponding JobDir, and
then kill. If PCS_JobKillProcessName is not specified, then interrogate all jobs to determine
the job to kill.

This message was sent by Atlassian JIRA

View raw message