oodt-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (3980)" <chris.a.mattm...@jpl.nasa.gov>
Subject Re: Handling task failures in a workflow
Date Mon, 23 Nov 2015 16:56:39 GMT
Hey Val,

Yes the lifecycle model is used for state progression, and for
the UI, but when you use a different workflow engine, e.g., the
PriorityQueuedBasedWorkflowEngine, it starts to make more use of
the information. If you look up Workflow2 and where we were going
with it in OODT you will see the evolution of the workflow manager
and state/task management. See OODT-491 for more information and
OODT-215. I have some items left to finish there in order for more
robust failure conditions to automatically be handled shy the
framework and would love some interest and help.


Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA

-----Original Message-----
From: "Mallder, Valerie" <Valerie.Mallder@jhuapl.edu>
Reply-To: "dev@oodt.apache.org" <dev@oodt.apache.org>
Date: Wednesday, November 11, 2015 at 4:22 PM
To: "dev@oodt.apache.org" <dev@oodt.apache.org>
Subject: Handling task failures in a workflow

>Hi All,
>I'd like to revisit a topic that came up not too long ago about
>communicating a task's completion status back to its parent workflow.  I
>just want to make sure I understand what other folks are doing to handle
>this before I go down a rabbit hole trying to invent a way to do it.
>I have been investigating the workflow manager code to see what
>capabilities are already built-in and what are not. I have just a simple
>sequential workflow, so, I have really only investigated the workflow
>manager from the ThreadPoolWorkflowEngine and
>IterativeWorkflowProcessorThread point of view.  I looked to find all of
>the places where something is setting or getting any type of 'status'
>value or lifecycle stage type of value.  And I have some questions
>related to what I found.
>I found that a workflow will stop executing tasks if any of its tasks
>require a specific metadata key and value to be present in the shared
>metadata, but that key and value are not present. If this occurs, the
>workflow instance status is set to METADATA_MISSING and is updated in the
>workflow instance repository and then the workflow stops iterating
>through the list of tasks.  This is the only error condition I could find
>that will cause a workflow to stop executing.  There does exist a
>workflow status called "ERROR", but I could not find any place where the
>workflow status is being set to "ERROR" or anyplace where the workflow
>status is being checked to see if it is equal to "ERROR".  So, it looks
>like workflow manager is very "success oriented". Which is fine. But for
>me it raises the question how are people accounting for errors that might
>occur during execution of a workflow that might require the workflow to
>stop or not execute one of its tasks? For example, let's say if task
>"abc" fails, and you want workflow manager to continue with the next
>task, but you don't want workflow manager to run task "xyz" which is
>further down the task list.   Is anyone out there doing anything to
>handle this type of situation?  And if so, please tell me what approach
>you took to implement your logic to handle it.  I am really hoping that
>someone else has explored this area before me J (Right now, it is looking
>like I would need to write my own IterativeWorkflowProcessorThread and
>build the logic into it, but that is a rabbit hole I don't want to go
>I should note that I am using the word "task" to mean both a PGE Task and
>a non-PGE task. I realize that the 'exe' section inside of a PGE Task
>gives you the freedom to write scripts that do check returned values from
>different commands that are in the script. But, I'm really trying to ask
>about the communication of statuses that exist at a layer above the PGE
>Thanks for your help!
>Valerie A. Mallder
>New Horizons Deputy Mission System Engineer
>The Johns Hopkins University/Applied Physics Laboratory
>11100 Johns Hopkins Rd (MS 23-282), Laurel, MD 20723
>240-228-7846 (Office) 410-504-2233 (Blackberry)

View raw message