oodt-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (3980)" <chris.a.mattm...@jpl.nasa.gov>
Subject Re: how to pass arguments to workflow task that is external script
Date Sat, 11 Oct 2014 15:51:24 GMT
Hey Val,

No worries, I can answer all the below for you:

> Yes, I saw the learn by example documentation. And maybe
> if I hadn't already built a configuration that runs the
> crawler successfully the example might make more sense to me.
> It doesn't look like anything I have already done can be used
> in that setup. 

Yeah the thing is that CAS-PGE is an integrated Workflow Task that
brings together typical pipeline processing activities. For example,
we realized in those activities that the typical science pipeline
workflow is:

(1) .. generate input files/metadata, switches, flags, etc. ..
(2) .. use information from #1 to execute science algorithm ..
(3) .. execute algorithm ..
(4) .. figure out if the algorithm produced outputs, if so, extract
metadata for
them and ingest .. if not, help it generate some output ..
(5) .. catalog/ingest those outputs and tag them with info from the
workflow system ..

We realized that having the above as separate steps/workflow tasks would
be greatly difficult especially in a distributed environment, so we
isolated
them in a shared nothing scenario to a single *job* directory in which all
of
the above happens (out there somewhere on the nodes in your compute
system).
So, it¹s like a shared nothing way to execute jobs with full pedigree, and
processing information, *just like they would execute as if the scientist
herself were running them outside of the system* and *unbeknownst* to the
algorithm itself, b/c it¹s running in exactly the same fashion. In essence
this is implementing the vision in my Nature article and a big one we¹ve
had at JPL in recent years about ³unobtrusive and rapid science algorithm
integration². There are some descriptions of it here in my recent J. of
Big Data article (Open Access, not behind paywall):

http://www.journalofbigdata.com/content/1/1/6



> It is not clear whether I have to move, duplicate or rewrite my
>extractor, 
> move or duplicate or rewrite my metadata definitions, etc. Is the
>metadata 
> extracted from the crawler's extractor shared with other PGE tasks?

You shouldn¹t have to do either. Your extractors can be used/integrated
into the final portions of the CAS-PGE flow through the use of the
<output> tags that you are working on in the other thread. Pointing you
at DRAT was one of my hopes that you could check out its CAS-PGE files
and see how the <output> tags are used there and emulate them.

> The 
> example uses a different directory structure than my crawler understands
> and it's not clear how to map the example directories to crawler
> directories. 

The crawler should be able to crawl through directories that are
specified in the <output> section of the CAS-PGE file and that
match the specified <files regExp tagsŠ

> In the example's tasks.xml file, it's not clear whether the
> configuration that is shown is one that is required to be defined for
>each 
> task that you define in the file. There are just a lot of questions that
> come up when I try to adapt the example to my system specifically.  But,
>I 
> will look at DRAT and try to work through it.

No worries, yes this is quite involved, but it will pay off in the
end. Let me see if can answer the above:

1. Workflow Tasks.xml needs to declare a CAS-PGE task, like so:
http://svn.apache.org/repos/asf/oodt/trunk/pge/src/main/resources/examples/
WorkflowTask/


2. CAS-PGE task should use the met keys (in the old style, as
shown above as that¹s what RADIX uses)

3. CAS-PGE task should point at a CAS-PGE XML config file, examples
of which are here:

http://svn.apache.org/repos/asf/oodt/trunk/pge/src/main/resources/examples/
PgeConfigFiles/pge-config.xml


Specific examples e.g., in DRAT:
https://github.com/chrismattmann/drat/tree/master/pge/src/main/resources/co
nfig


4. The config specifies 3 key areas:
  - input (how to generate and write input/config files for the underlying
algorithm
using a Writer interface)
  - execution block or how to execute and run the algorithm. Each <cmd>
line in that
block becomes a line in the subsequent script generated by CAS-PGE and
defined by the
<exe dir="[JobDir]" shell="/bin/bash²> area (e.g., /bin/bash would be the
type of

script generated based on that previous definition).
  - output block which specifies
    - folders and files to scan for to run specified met extractors on to
extract 
metadata *before* then crawling the output and ingesting into the file
manager using
CAS-Crawler

OK hopefully the above makes sense. Let me know if I can help more.

Keep on trucking you¹re close!

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: <Mallder>, Valerie <Valerie.Mallder@jhuapl.edu>
Reply-To: "dev@oodt.apache.org" <dev@oodt.apache.org>
Date: Tuesday, October 7, 2014 at 6:08 PM
To: "dev@oodt.apache.org" <dev@oodt.apache.org>
Subject: RE: how to pass arguments to workflow task that is external script

>Yes, I saw the learn by example documentation. And maybe if I hadn't
>already built a configuration that runs the crawler successfully the
>example might make more sense to me. It doesn't look like anything I have
>already done can be used in that setup. It is not clear whether I have to
>move, duplicate or rewrite my extractor, move or duplicate or rewrite my
>metadata definitions, etc. Is the metadata extracted from the crawler's
>extractor shared with other PGE tasks? The example uses a different
>directory structure than my crawler understands and it's not clear how to
>map the example directories to crawler directories. In the example's
>tasks.xml file, it's not clear whether the configuration that is shown is
>one that is required to be defined for each task that you define in the
>file. There are just a lot of questions that come up when I try to adapt
>the example to my system specifically.  But, I will look at DRAT and try
>to work through it.
>
>Thanks,
>Val
>
>
>
>Valerie A. Mallder
>New Horizons Deputy Mission System Engineer
>Johns Hopkins University/Applied Physics Laboratory
>
>
>> -----Original Message-----
>> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>> Sent: Tuesday, October 07, 2014 11:02 AM
>> To: dev@oodt.apache.org
>> Subject: Re: how to pass arguments to workflow task that is external
>>script
>>
>> Thanks Val, I agree, yes, CAS-PGE is complex.
>>
>> Did you see the learn by example wiki page:
>>
>> 
>>https://cwiki.apache.org/confluence/display/OODT/CAS-PGE+Learn+by+Example
>>
>>
>> I think it?s pretty basic and illustrates what CAS-PGE does.
>>
>> Basically the jist of it is:
>>
>> 1. you only need to create a PGEConfig.xml file that specifies:
>>   - how to generate input for your integrated algorithm
>>   - how to execute your algorithm (e.g., how to generate a script that
>>executes it)
>>   - how to generate metadata from the output, and then to crawl the
>>files
>> + met and get the outputs into the file manager
>>
>> 2. you go into workflow tasks.xml, define a new CAS-PGE type task,
>>point at this
>> config file, and provide CAS-PGE task properties (an example is
>> here:
>> 
>>http://svn.apache.org/repos/asf/oodt/trunk/pge/src/main/resources/example
>>s/
>> WorkflowTask/
>>
>>
>> If you want to see a basic example of CAS-PGE in action, check out DRAT:
>>
>> https://github.com/chrismattmann/drat/
>>
>> It?s a RADIX-based deployment with 2 CAS-PGEs (one for the MIME
>>partition; and
>> another for RAT).
>>
>> Check that out, see how DRAT works (and integrates CAS-PGE) and then
>>let me
>> know if you are still confused and I will be glad to help more.
>>
>> Cheers,
>> Chris
>>
>> ------------------------
>> Chris Mattmann
>> chris.mattmann@gmail.com
>>
>>
>>
>>
>> -----Original Message-----
>> From: "Mallder, Valerie" <Valerie.Mallder@jhuapl.edu>
>> Reply-To: <dev@oodt.apache.org>
>> Date: Tuesday, October 7, 2014 at 4:56 PM
>> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>> Subject: RE: how to pass arguments to workflow task that is external
>>script
>>
>> >Thanks Chris,
>> >
>> >The CAS-PGE is pretty complex, I've read the documentation and it is
>> >still way over my head.  Is there any documentation or examples for how
>> >to integrate the crawler into it?  For instance, can I still use the
>> >crawler_launcher script? Will the ExternMetExtractor and a
>> >postIngestSuccess ExternAction script work that I created to work with
>> >the crawler still work "as is" in the CAS-PGE ? Or, should I invoke
>> >them differently?  What about the Metadata that I extracted with the
>>crawler?
>> >Do I have to redefine the metadata elements in another configuration
>> >file or policy file?  If there is any documentation on doing this
>> >please point me to the right place because I didn't see anything that
>> >addressed these kinds of questions.
>> >
>> >Thanks,
>> >Val
>> >
>> >Do I have to define these any differently in the PGE configuration
>> >
>> >
>> >Valerie A. Mallder
>> >New Horizons Deputy Mission System Engineer Johns Hopkins
>> >University/Applied Physics Laboratory
>> >
>> >> -----Original Message-----
>> >> From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
>> >> Sent: Tuesday, October 07, 2014 8:16 AM
>> >> To: dev@oodt.apache.org
>> >> Subject: Re: how to pass arguments to workflow task that is external
>> >>script
>> >>
>> >> Hi Val,
>> >>
>> >> Thanks for the detailed report. My suggestion would be to use CAS-PGE
>> >>directly  instead of ExternScriptTaskInstance. That application is not
>> >>well maintained,  doesn?t produce a log, etc, etc, all of the things
>> >>you?ve noted.
>> >>
>> >> CAS-PGE on the other hand, will (a) prepare input for your task; (b)
>> >>describe how  to run your task (even as a script and will generate a
>> >>script); and (c) will run met  extractors and fork a crawler in your
>> >>job directory in the end.
>> >>
>> >> I think it?s what you?re looking for and it?s way more well
>> >>documented on the wiki.
>> >>
>> >> Please check it out and let me know what you think.
>> >>
>> >> Cheers,
>> >> Chris
>> >>
>> >> ------------------------
>> >> Chris Mattmann
>> >> chris.mattmann@gmail.com
>> >>
>> >>
>> >>
>> >>
>> >> -----Original Message-----
>> >> From: "Mallder, Valerie" <Valerie.Mallder@jhuapl.edu>
>> >> Reply-To: <dev@oodt.apache.org>
>> >> Date: Monday, October 6, 2014 at 11:53 PM
>> >> To: "dev@oodt.apache.org" <dev@oodt.apache.org>
>> >> Subject: how to pass arguments to workflow task that is external
>> >> script
>> >>
>> >> >Hello,
>> >> >
>> >> >I'm stuck again L  This time I'm stuck trying to start my crawler as
>> >> >a task using the workflow manager.  I am not using a PGE task right
>>now.
>> >> >I'm just trying to do something simple with the workflow manager,
>> >> >filemgr, and crawler.  I have read all of the documentation that is
>> >> >available on the workflow manager and have tried to piece together a
>> >> >setup based on the examples, but, things seem to be working
>> >> >differently now and the documentation hasn't caught up, which is
>> >> >totally understandable  and not a criticism. Just want you to know
>> >> >that I try to do my due diligence before bothering anyone for help.
>> >> >
>> >> >I am not running the resource manager, and I have commented out
>> >> >setting the resource manager url in the workflow.properties file so
>> >> >that workflow manager will execute the job locally.
>> >> >
>> >> >I am sending workflow manager an event (via the command line using
>> >> >wmgr-client) called "startJediPipeline". Workflow manager receives
>> >> >the event, and retrieves my workflow from the repository and tries
>> >> >to execute the first (and only) task, and then it crashes.  My task
>> >> >is an external script (the crawler_launcher script) and I need to
>> >> >pass several arguments to it. I've spent all day trying to figure
>> >> >out how to pass arguments to the and ExternScriptTaskInstance, but
>> >> >there are no examples of doing this, so I had to wing it. I tried
>> >> >putting the arguments in the task configuration properties. That
>> >> >didn't work. So I tried putting the arguments in the metadata
>> >> >properties, and that hasn't worked. So, your suggestions are
>> >> >welcome!  Thanks so much.  Here's the error log,  And contents of my
>> tasks.xml file follow it at the end.
>> >> >
>> >> >Workflow Manager started PID file
>> >> >(/homes/malldva1/project/jedi/users/jedi-pipeline/oodt-deploy/workfl
>> >> >ow/
>> >> >run
>> >> >/cas.workflow.pid).
>> >> >Starting OODT File Manager [  Successful  ] Starting OODT Resource
>> >> >Manager [  Failed  ] Starting OODT Workflow Manager [  Successful  ]
>> >> >slothrop:{~/project/jedi/users/jedi-pipeline/oodt-deploy/bin} Oct
>> >> >06,
>> >> >2014 5:48:30 PM
>> >> >org.apache.oodt.cas.workflow.system.XmlRpcWorkflowManager
>> >> >loadProperties
>> >> >INFO: Loading Workflow Manager Configuration Properties from:
>> >> >[/homes/malldva1/project/jedi/users/jedi-pipeline/oodt-deploy/workfl
>> >> >ow/
>> >> >etc
>> >> >/workflow.properties]
>> >> >Oct 06, 2014 5:48:30 PM
>> >> >org.apache.oodt.cas.workflow.engine.ThreadPoolWorkflowEngineFactory
>> >> >getResmgrUrl
>> >> >INFO: No Resource Manager URL provided or malformed URL: executing
>> >> >jobs locally. URL: [null] Oct 06, 2014 5:48:30 PM
>> >> >org.apache.oodt.cas.workflow.system.XmlRpcWorkflowManager <init>
>> >> >INFO: Workflow Manager started by malldva1 Oct 06, 2014 5:48:41 PM
>> >> >org.apache.oodt.cas.workflow.system.XmlRpcWorkflowManager
>> >> >handleEvent
>> >> >INFO: WorkflowManager: Received event: startJediPipeline Oct 06,
>> >> >2014
>> >> >5:48:41 PM org.apache.oodt.cas.workflow.system.XmlRpcWorkflowManager
>> >> >handleEvent
>> >> >INFO: WorkflowManager: Workflow Jedi Pipeline Workflow retrieved for
>> >> >event startJediPipeline Oct 06, 2014 5:48:41 PM
>> >> >org.apache.oodt.cas.workflow.engine.IterativeWorkflowProcessorThread
>> >> >checkTaskRequiredMetadata
>> >> >INFO: Task: [Crawler Task] has no required metadata fields Oct 06,
>> >> >2014
>> >> >5:48:42 PM
>> >> >org.apache.oodt.cas.workflow.engine.IterativeWorkflowProcessorThread
>> >> >executeTaskLocally
>> >> >INFO: Executing task: [Crawler Task] locally
>> >> >java.lang.NullPointerException
>> >> >        at
>> >> >org.apache.oodt.cas.workflow.examples.ExternScriptTaskInstance.run(E
>> >> >xte
>> >> >rnS
>> >> >criptTaskInstance.java:72)
>> >> >        at
>> >> >org.apache.oodt.cas.workflow.engine.IterativeWorkflowProcessorThread
>> >> >.ex
>> >> >ecu
>> >> >teTaskLocally(IterativeWorkflowProcessorThread.java:574)
>> >> >        at
>> >> >org.apache.oodt.cas.workflow.engine.IterativeWorkflowProcessorThread
>> >> >.ru
>> >> >n(I
>> >> >terativeWorkflowProcessorThread.java:321)
>> >> >        at
>> >> >EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown
>> >>Source)
>> >> >        at java.lang.Thread.run(Thread.java:745)
>> >> >Oct 06, 2014 5:48:42 PM
>> >> >org.apache.oodt.cas.workflow.engine.IterativeWorkflowProcessorThread
>> >> >executeTaskLocally
>> >> >WARNING: Exception executing task: [Crawler Task] locally: Message:
>> >> >null
>> >> >
>> >> >
>> >> >
>> >> >
>> >> ><cas:tasks xmlns:cas="http://oodt.jpl.nasa.gov/1.0/cas">
>> >> ><!--
>> >> >  TODO: Add some examples
>> >> >-->
>> >> >
>> >> >   <task id="urn:oodt:crawlerTask" name="Crawler Task"
>> >>
>> >>>class="org.apache.oodt.cas.workflow.examples.ExternScriptTaskInstance
>> >>>"/>
>> >> >      <conditions/>  <!-- There are no pre execution conditions
>> >> >right now
>> >> >-->
>> >> >      <configuration>
>> >> >          <property name="ShellType" value="/bin/sh" />
>> >> >          <property name="PathToScript"
>> >> >value="[OODT_HOME]/crawler/bin/crawler_launcher"/>
>> >> >      </configuration>
>> >> >      <metadata>
>> >> >          <args>
>> >> >             <arg>--operation</arg>
>> >> >                <arg>--launchAutoCrawler</arg>
>> >> >             <arg>--productPath</arg>
>> >> >                <arg>[OODT_HOME]/data/staging</arg>
>> >> >             <arg>--filemgrUrl</arg>
>> >> >                <arg>http://localhost:9000</arg>
>> >> >             <arg>--clientTransferer</arg>
>> >> >
>> >> ><arg>org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFacto
>> >> >ry<
>> >> >/ar
>> >> >g>
>> >> >             <arg>--mimeExtractorRepo</arg>
>> >> >
>> >> ><arg>[$OODT_HOME]/extensions/policy/mime-extractor-map.xml</arg>
>> >> >             <arg>--actionIds</arg>
>> >> >                <arg>MoveFileToLevel0Dir</arg>
>> >> >          </args>
>> >> >      </metadata>
>> >> ></cas:tasks>
>> >> >
>> >> >
>> >> >Valerie A. Mallder
>> >> >
>> >> >New Horizons Deputy Mission System Engineer The Johns Hopkins
>> >> >University/Applied Physics Laboratory
>> >> >11100 Johns Hopkins Rd (MS 23-282), Laurel, MD 20723
>> >> >240-228-7846 (Office) 410-504-2233 (Blackberry)
>> >> >
>> >>
>> >
>>


Mime
View raw message