oodt-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lewis John Mcgibbney <lewis.mcgibb...@gmail.com>
Subject Re: what is batch stub? Is it necessary?
Date Thu, 09 Oct 2014 21:25:07 GMT
Can you do a ls -al of your /lib directory please?
Also can you please provide any relevant snippet of your pom.xml which
contains filemgr pom.xml dependency
Thank you
Lewis

On Thu, Oct 9, 2014 at 2:20 PM, Mallder, Valerie <Valerie.Mallder@jhuapl.edu
> wrote:

> Thanks Chris,  (Thanks everyone for all of the help, it was helpful,
> really it was :)  )
>
> My brain is exhausted ..... (heavy sigh) and I feel like I have to start
> all over again.
>
> My intention (after I got the crawler and filemanager working together
> last week), was to integrate it with the workflow manager to demonstrate
> launching a workflow that consisted of a simple script that runs before the
> crawler, and then run the crawler.  After that, I was going to try to
> integrate a java application into the workflow, and try to continue
> integrating new things step by step. I think everything would have been
> fine in this simple setup if I could have just gotten the
> ExternScriptTaskInstance to run. But that was a huge fail.  It doesn't look
> like the test program for that class tests it the way I want to use it, so
> I have no idea if it actually works or not.  The code implies that you can
> specify arguments to your external script, but I could not find a way to
> get them read in.  The the getAllMetada method always returned a null list
> of arguments that causes an exception on line 72.
>
> So right now, I've basically gone back to the beginning of using CAS-PGE,
> and I'm trying to get the crawler to run as the very first step in my
> pipeline the raw telemetry files that are dropped off by FEI.  After the
> ingestion and archival, one of the postIngestSuccess actions of the crawler
> copies all of the new raw telemetry files to a directory where we store all
> of the level 0 files.  The level 0 directory (and all of it's
> subdirectories and files) is what I consider to be the "output" of this
> simple first step of the pipeline.  I realize that I may need to start a
> crawler again at a later point in the pipeline. But I want to focus on one
> step at a time.
>
> Chris, In regards to your comments below, here are two questions followed
> by the contents of my .xml files for review.
>
> [1]- When you say " define blocks in the <output>..</output> section of
> the XML file", what xml file are you referring to? I think the
> <output>..</output> tags can only go in the PGE config file, is that
> correct?
>
> Here is what I have in my fei-crawler-pge-config.xml file. Is this OK?
>    <!-- Files to ingest -->
>    <output>
>       <dir="[FEI_DROP_DIR]" envReplace="true" />
>    </output>
>
> [2] If I don't need to define a CAS-PGE Task, how do I tell the workflow
> to start the crawler?   Right now, I am trying to do it with a task, but if
> you can tell me how to do it without a task, I will be happy to try it.
>
>
>
> So, here is my current workflow:
>
> <cas:workflow xmlns:cas="http://oodt.jpl.nasa.gov/1.0/cas"
> id="urn:oodt:jediWorkflowId" name="jediWorkflowName">
>    <tasks>
>        <task id="urn:oodt:feiCrawlerTaskId" name="feiCrawlerTaskName" />
>    </tasks>
> </cas:workflow>
>
> Here is my current task:
>
> <cas:tasks xmlns:cas="http://oodt.jpl.nasa.gov/1.0/cas">
>    <task id="urn:oodt:feiCrawlerTaskId" name="feiCrawlerTaskName"
> class="org.apache.oodt.cas.pge.StdPGETaskInstance">
>       <configuration>
>          <property name="PGETask/Name" value="feiCrawlerTaskname"/>
>          <property name="PGETask/ConfigFilePath"
> value="[OODT_HOME]/extensions/config/fei-crawler-pge-config.xml"
> envReplace="true"/>
>          <property name="PGETask/DumpMetadata" value="true"/>
>          <property name="PGETask/WorkflowManagerUrl"
> value="[WORKFLOW_URL]" envReplace="true" />
>          <property name="PGETask/Query/FileManagerUrl"
>  value="[FILEMGR_URL]" envReplace="true"/>
>          <property name="PGETask/Ingest/FileManagerUrl"
>  value="[FILEMGR_URL]" envReplace="true"/>
>
>          <property name="PGETask/Query/ClientTransferServiceFactory"
> value="org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory"/>
>          <property name="PGETask/Ingest/CrawlerConfigFile"
> value="file:[CRAWLER_HOME]/policy/crawler-config.xml" envReplace="true"/>
>          <property name="PGETask/Ingest/MimeExtractorRepo"
> value="file:[OODT_HOME]/extensions/policy/mime-extractor-map.xml"
> envReplace="true"/>
>          <property name="PGETask/Ingest/ActionIds"
> value="MoveFileToLevel0Dir" envReplace="true"/>
>          <property name="PGE_HOME" value="[PGE_HOME]" envReplace="true"/>
>       </configuration>
>    </task>
> </cas:tasks>
>
> And, here is my current PGE config - fei-crawler-pge-config.xml
>
> <pgeConfig>
>    <!-- How to run the PGE -->
>    <exe dir="[OODT_HOME]">
>       <cmd>mkdir [JobDir]</cmd>
>    </exe>
>
>    <!-- Files to ingest -->
>    <output>
>       <dir="[FEI_DROP_DIR]" envReplace="true" />
>    </output>
>
> <!-- Custom metadata to add to output files -->
>    <customMetadata>
>      <metadata key="JobDir" value="[OODT_HOME]/data/pge/jobs" />
>    </customMetadata>
> </pgeConfig>
>
>
>
> With the settings in these settings I do not get to the point where the
> first command in the PGE config gets executed. The data/pge/jobs directory
> does not get created.  However, the workflow starts and the task gets
> submitted to the resource manager, and a new thread called "Thread-2" gets
> spawned. But, "Thread-2" gets an exception and that's it.  I thought maybe
> it was due to the fact that the filemgr jar is not in the resmgr/lib
> direcory when you do the radix install. So, I copied the filemgr jar file
> to resmgr/lib and ran again, but still get the same the exception.  And,
> the filemgr IS running, and I shut down all of the filemgr, workfow mgr,
> resmgr and batch_stub each time I run so that every run starts with new
> processes every time.
>
> If anyone has any recommendations on a better way to do this please let me
> know.
>
> Thanks,
> Val
>
>
>
>
>
>
> INFO: Task: [feiCrawlerTaskName] has no required metadata fields
> Exception in thread "Thread-2" java.lang.NoClassDefFoundError:
> org/apache/oodt/cas/filemgr/metadata/CoreMetKeys
>         at java.lang.ClassLoader.defineClass1(Native Method)
>         at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
>         at
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>         at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>         at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>         at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>         at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>         at java.lang.ClassLoader.defineClass1(Native Method)
>         at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
>         at
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>         at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>         at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>         at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>         at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>         at java.lang.Class.forName0(Native Method)
>         at java.lang.Class.forName(Class.java:190)
>         at
> org.apache.oodt.cas.workflow.util.GenericWorkflowObjectFactory.getTaskObjectFromClassName(GenericWorkflowObjectFactory.java:169)
>         at
> org.apache.oodt.cas.workflow.engine.IterativeWorkflowProcessorThread.run(IterativeWorkflowProcessorThread.java:222)
>         at
> EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.oodt.cas.filemgr.metadata.CoreMetKeys
>         at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>         at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>         ... 28 more
>
>
>
> Valerie A. Mallder
> New Horizons Deputy Mission System Engineer
> Johns Hopkins University/Applied Physics Laboratory
>
>
> > -----Original Message-----
> > From: Chris Mattmann [mailto:chris.mattmann@gmail.com]
> > Sent: Wednesday, October 08, 2014 2:52 PM
> > To: dev@oodt.apache.org
> > Subject: Re: what is batch stub? Is it necessary?
> >
> > Hi Val,
> >
> > I don?t think you need to run a CAS-PGE task to call crawler_launcher.
> If you
> > define blocks in the <output>..</output> section of the XML file, a
> crawler will be
> > forked in the job working directory of CAS-PGE and crawl your specified
> output.
> >
> > I believe that will accomplish the same goal of what you are looking for.
> >
> > No need to have crawling be a separate task from CAS-PGE - CAS-PGE will
> do
> > the crawling for you! :)
> >
> > Cheers,
> > Chris
> >
> > ------------------------
> > Chris Mattmann
> > chris.mattmann@gmail.com
> >
> >
> >
> >
> > -----Original Message-----
> > From: "Verma, Rishi (398J)" <Rishi.Verma@jpl.nasa.gov>
> > Reply-To: <dev@oodt.apache.org>
> > Date: Thursday, October 9, 2014 at 2:44 AM
> > To: "dev@oodt.apache.org" <dev@oodt.apache.org>
> > Subject: Re: what is batch stub? Is it necessary?
> >
> > >Hi Val,
> > >
> > >Yep - here?s a link to the tasks.xml file:
> > >https://github.com/riverma/xdata-jpl-netscan/blob/master/oodt-netscan/w
> > >ork flow/src/main/resources/policy/tasks.xml
> > >
> > >> The problem is that the ExternScriptTaskInstance is unable to
> > >>recognize the command line arguments that I want to pass to the
> > >>crawler_launcher script.
> > >
> > >
> > >Hmm.. could you share your workflow manager log, or better yet, the
> > >batch_stub output? Curious to see what error is thrown.
> > >
> > >Is a script file being generated for your PGE? For example, inside your
> > >[PGE_HOME] directory, and within the particular job directory created
> > >for your execution of a workflow, you will see some files starting with
> > >?sciPgeExeScript_??. You?ll find one for your pgeConfig, and you can
> > >check to see what the PGE commands actually translate into, with
> > >respect to a shell script format. If that file is there, take a look at
> > >it, and validate whether the command works within the script (i.e.
> > >copy/paste and run the crawler command manually).
> > >
> > >Another suggestion is to take a step back, and build up slowly, i.e.:
> > >1. Do an ?echo? command within your PGE first. (e.g. <cmd> echo ?Hello
> > >APL.? > /tmp/test.txt</cmd>) 2. If above works, do a crawler_launcher
> > >empty command(e.g.
> > ><cmd>/path/to/oodt/crawler/bin/crawler_launcher</cmd>) and verify
the
> > >batch_stub or Workflow Manager prints some kind of output when you run
> > >the workflow.
> > >3. Build up your crawler_launcher command piece by piece to see where
> > >it is failing
> > >
> > >Thanks,
> > >Rishi
> > >
> > >On Oct 8, 2014, at 4:24 PM, Mallder, Valerie
> > ><Valerie.Mallder@jhuapl.edu>
> > >wrote:
> > >
> > >> Hi Rishi,
> > >>
> > >> Thank you very much for pointing me to your working example. This is
> > >>very helpful.  My pgeConfig looks very similar to yours.  So, I
> > >>commented out the resource manager like you suggested and tried
> > >>running again without the resource manager. And my problem still
> > >>exists. The problem is that the ExternScriptTaskInstance is unable to
> > >>recognize the command line arguments that I want to pass to the
> > >>crawler_launcher script. Could you send me a link to your tasks.xml
> > >>file? I'm curious as to how you defined your task.  My pgeConfig and
> tasks.xml
> > are below.
> > >>
> > >> Thanks!
> > >> Val
> > >>
> > >>
> > >> <?xml version="1.0" encoding="UTF-8"?> <pgeConfig>
> > >>
> > >>   <!-- How to run the PGE -->
> > >>   <exe dir="[JobDir]" shell="/bin/sh" envReplace="true">
> > >>        <cmd>[CRAWLER_HOME]/bin/crawler_launcher --operation
> > >>--launchAutoCrawler \
> > >>        --filemgrUrl [FILEMGR_URL] \
> > >>        --clientTransferer
> > >>org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory \
> > >>        --productPath [JobInputDir] \
> > >>        --mimeExtractorRepo
> > >>[OODT_HOME]/extensions/policy/mime-extractor-map.xml \
> > >>        --actionIds MoveFileToLevel0Dir</cmd>
> > >>   </exe>
> > >>
> > >>   <!-- Files to ingest -->
> > >>   <output/>
> > >>   </output>
> > >>
> > >> <!-- Custom metadata to add to output files -->
> > >>   <customMetadata>
> > >>      <metadata key="JobDir" val="[OODT_HOME]"/>
> > >>      <metadata key="JobInputDir" val="[FEI_DROP_DIR]"/>
> > >>      <metadata key="JobOutputDir" val="[JobDir]/data/pge/jobs"/>
> > >>      <metadata key="JobLogDir" val="[JobDir]/data/pge/logs"/>
> > >>   </customMetadata>
> > >>
> > >> </pgeConfig>
> > >>
> > >>
> > >>
> > >> <!-- tasks.xml **************************************************-->
> > >>
> > >> <cas:tasks xmlns:cas="http://oodt.jpl.nasa.gov/1.0/cas">
> > >>
> > >>   <task id="urn:oodt:crawlerLauncherId" name="crawlerLauncherName"
> > >>class="org.apache.oodt.cas.workflow.examples.ExternScriptTaskInstance">
> > >>      <conditions/>  <!-- There are no pre execution conditions
right
> > >>now -->
> > >>      <configuration>
> > >>
> > >>          <property name="ShellType" value="/bin/sh" />
> > >>          <property name="PathToScript"
> > >>value="[CRAWLER_HOME]/bin/crawler_launcher" envReplace="true" />
> > >>
> > >>          <property name="PGETask_Name" value="crawler_launcher PGE
> > >>Task"/>
> > >>          <property name="PGETask_ConfigFilePath"
> > >>value="[OODT_HOME]/extensions/config/crawler-pge-config.xml"
> > >>envReplace="true" />
> > >>      </configuration>
> > >>   </task>
> > >>
> > >> </cas:tasks>
> > >>
> > >> Valerie A. Mallder
> > >> New Horizons Deputy Mission System Engineer Johns Hopkins
> > >> University/Applied Physics Laboratory
> > >>
> > >>
> > >>> -----Original Message-----
> > >>> From: Verma, Rishi (398J) [mailto:Rishi.Verma@jpl.nasa.gov]
> > >>> Sent: Wednesday, October 08, 2014 6:01 PM
> > >>> To: dev@oodt.apache.org
> > >>> Subject: Re: what is batch stub? Is it necessary?
> > >>>
> > >>> Hi Valerie,
> > >>>
> > >>>>>>> All I am trying to do is run "crawler_launcher" as
a workflow
> > >>>>>>> task in the CAS PGE environment.
> > >>>
> > >>> Interesting. I have a working example here [1] you can look at that
> > >>>does this exact  thing.
> > >>>
> > >>>>>>> So, if "batchstub" is necessary in this scenario, pleast
tell me
> > >>>>>>> what it is, why it is necessary, and how to run it
(please
> > >>>>>>> provide exact syntax to put in my startup shell script,
because
> > >>>>>>> I would never be able to figure it out for myself and
I don't
> > >>>>>>> want to have to bother everyone again.)
> > >>>
> > >>> Batchstub is only necessary if your Workflow Manger is sending jobs
> > >>>to Resource  Manager for execution (where the default execution is to
> > >>>run the job in something  called a ?batch stub? executable). Think of
> > >>>batch stubs as a small wrapper  program that takes a bundle of
> > >>>executable instructions from Resource Manager,  and executes them in
> > >>>a shell environment within a given remote (or
> > >>>local) machine.
> > >>>
> > >>> Here?s my suggestion:
> > >>> 1. Like Paul suggested, go to $OODT_HOME/resmgr/bin, and execute the
> > >>>following command (it?ll start a batch stub in a terminal on port
> > >>>2001):
> > >>>> ./batch_stub 2001
> > >>>
> > >>> If the above step doesn?t fix your problem, you can also try having
> > >>>Workflow  Manager NOT send jobs to Resource Manager for execution,
> > >>>and instead execute  jobs locally through Workflow Manager itself (on
> > >>>localhost only!). To disable job  transfer to Resource Manger, you?ll
> > >>>need to modify the Workflow Manager  properties file
> > >>>($OODT_HOME/wmgr/etc/workflow.properties), and specifically  comment
> > >>>out the ?org.apache.oodt.cas.workflow.engine.resourcemgr.url?
> > >>>line.
> > >>> I?ve done this in my example code below, see [2] for an exact
> > >>>example of this.
> > >>> After modifying workflow.properties, make sure to restart workflow
> > >>>manager
> > >>> ($OODT_HOME/wmgr/bin/wmgr stop   followed by
> > $OODT_HOME/wmgr/bin/wmgr
> > >>> start).
> > >>>
> > >>> Thanks,
> > >>> Rishi
> > >>>
> > >>> [1] https://github.com/riverma/xdata-jpl-netscan/blob/master/oodt-
> > >>>
> > >>>netscan/pge/src/main/resources/policy/netscan-getipv4entriesrandomsam
> > >>>ple
> > >>>.xml
> > >>> [2] https://github.com/riverma/xdata-jpl-netscan/blob/master/oodt-
> > >>> netscan/workflow/src/main/resources/etc/workflow.properties
> > >>>
> > >>> On Oct 8, 2014, at 2:31 PM, Ramirez, Paul M (398J)
> > >>> <paul.m.ramirez@jpl.nasa.gov> wrote:
> > >>>
> > >>>> Valerie,
> > >>>>
> > >>>> I would have thought it would have just not used a batch stub by
> > >>>>default. That
> > >>> said if you go into the $OODT_HOME/resmgr/bin there should be a
> > >>>script to start a  batch stub. Right now on my phone I forget the
> > >>>name of the script but if you more  the file you will see the Java
> > >>>class name that corresponds to below.
> > >>>You should
> > >>> specify a port when you run the script which from the looks of the
> > >>>output below  should be 2001.
> > >>>>
> > >>>> HTH,
> > >>>> Paul R
> > >>>>
> > >>>> Sent from my iPhone
> > >>>>
> > >>>>> On Oct 8, 2014, at 2:04 PM, Mallder, Valerie
> > >>>>><Valerie.Mallder@jhuapl.edu>
> > >>> wrote:
> > >>>>>
> > >>>>> Well then, I'm proud to be a member :)  (I think .... )
> > >>>>>
> > >>>>>
> > >>>>> Valerie A. Mallder
> > >>>>> New Horizons Deputy Mission System Engineer Johns Hopkins
> > >>>>> University/Applied Physics Laboratory
> > >>>>>
> > >>>>>
> > >>>>>> -----Original Message-----
> > >>>>>> From: Bruce Barkstrom [mailto:brbarkstrom@gmail.com]
> > >>>>>> Sent: Wednesday, October 08, 2014 4:54 PM
> > >>>>>> To: dev@oodt.apache.org
> > >>>>>> Subject: Re: what is batch stub? Is it necessary?
> > >>>>>>
> > >>>>>> You have every right to bother everyone.
> > >>>>>> You won't get what you need unless you do.
> > >>>>>>
> > >>>>>> You get one honorary membership in the Society of General
> > >>>>>> Agitators
> > >>>>>> - at the rank of Major Agitator.
> > >>>>>>
> > >>>>>> Bruce B.
> > >>>>>>
> > >>>>>> On Wed, Oct 8, 2014 at 4:49 PM, Mallder, Valerie
> > >>>>>> <Valerie.Mallder@jhuapl.edu
> > >>>>>>> wrote:
> > >>>>>>
> > >>>>>>> Hello,
> > >>>>>>>
> > >>>>>>> I am still having trouble getting my CAS PGE crawler
task to run
> > >>>>>>>due to
> > >>>>>>> http://localhost:2001 being "down". I have spent the
last 2 days
> > >>>>>>>tracing through the resource manager code and tracked
this down
> > >>>>>>>to  line 146 of LRUScheduler where the XmlRpcBatchMgr
is failing
> > >>>>>>>to  execute the task remotely, because on line 75 of
> > >>>>>>>XmlRpcBatchMgrProxy (that was instantiated by XmlRpcBatchMgr
on
> > >>>>>>>its  line 74) is trying to call "isAlive" on the webservice
named
> > >>>>>>>"batchstub" which, to my knowledge, is not running because
I have
> > >>>>>>>not done
> > >>> anything explicitly to run it.
> > >>>>>>>
> > >>>>>>> All I am trying to do is run "crawler_launcher" as
a workflow
> > >>>>>>> task in the CAS PGE environment.  I had it running
perfectly
> > >>>>>>> before I started trying to make it run as part of a
workflow.  I
> > >>>>>>> really miss my crawler and really want it to run again
L
> > >>>>>>>
> > >>>>>>> So, if "batchstub" is necessary in this scenario, pleast
tell me
> > >>>>>>> what it is, why it is necessary, and how to run it
(please
> > >>>>>>> provide exact syntax to put in my startup shell script,
because
> > >>>>>>> I would never be able to figure it out for myself and
I don't
> > >>>>>>> want to have to bother everyone again.)
> > >>>>>>>
> > >>>>>>> Thanks so much!
> > >>>>>>>
> > >>>>>>> Val
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Valerie A. Mallder
> > >>>>>>>
> > >>>>>>> New Horizons Deputy Mission System Engineer The Johns
Hopkins
> > >>>>>>> University/Applied Physics Laboratory
> > >>>>>>> 11100 Johns Hopkins Rd (MS 23-282), Laurel, MD 20723
> > >>>>>>> 240-228-7846 (Office) 410-504-2233 (Blackberry)
> > >>>>>>>
> > >>>>>>>
> > >>>
> > >>> ---
> > >>> Rishi Verma
> > >>> NASA Jet Propulsion Laboratory
> > >>> California Institute of Technology
> > >>
> > >
> > >---
> > >Rishi Verma
> > >NASA Jet Propulsion Laboratory
> > >California Institute of Technology
> > >
> >
>
>


-- 
*Lewis*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message