oodt-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sheryl John <shery...@gmail.com>
Subject Re: Problem happened when I tried to run the script "crawler_launcher"
Date Fri, 10 Aug 2012 08:26:11 GMT
Hi Yunhee,


On Thu, Aug 9, 2012 at 8:19 PM, YunHee Kang <yunh.kang@gmail.com> wrote:

> Hi Sheryl,
>
> First off, I tried to run crawler_launcher with an option "-autoPC".
> Then I got a warning message as follows:
> Aug 10, 2012 11:12:26 AM org.apache.oodt.cas.crawl.ProductCrawler
> handleFile
> WARNING: Failed to pass preconditions for ingest of product:
>
> [/home/yhkang/oodt-0.5/cas-pushpull/staging/TESL2CO2/TES-Aura_L2-CO2-Nadir_r0000002147_F06_09.he5]
> Aug 10, 2012 11:12:26 AM org.apache.oodt.cas.crawl.ProductCrawler
> handleFile
> INFO: Handling file
>
> /home/yhkang/oodt-0.5/cas-pushpull/staging/TESL2CO2/TES-Aura_L2-CO2-Nadir_r0000002147_F06_09.he5.info.tmp
> Aug 10, 2012 11:12:26 AM org.apache.oodt.cas.crawl.ProductCrawler
> handleFile
> WARNING: Failed to pass preconditions for ingest of product:
>
> [/home/yhkang/oodt-0.5/cas-pushpull/staging/TESL2CO2/TES-Aura_L2-CO2-Nadir_r0000002147_F06_09.he5.info.tmp]
>
> I think that the warning message is related with preconditions for ingest.
> According to the run script for crawler_launcher,  it was wrong to
> describe the option "pids" for the preconditions.
> #!/bin/sh
> export STAGE_AREA=/home/yhkang/oodt-0.5/cas-pushpull/staging/TESL2CO2
> ./crawler_launcher \
>       -op   -stdPC \
>       -mfx tmp\
>       --productPath $STAGE_AREA\
>       --filemgrUrl http://localhost:8000\
>        --failureDir /tmp \
>        --actionIds DeleteDataFile MoveDataFileToFailureDir Unique \
>        --metFileExtension tmp \
>        -pids CheckThatDataFileSizeIsGreaterThanZero \
>        --clientTransferer
> org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory
> Let me know how to fix the warning.
>
>
I see that your data file is *.he5 and the metadata file is *.he5.info.tmp.
Specify your '-mfx' option as  'info.tmp'
StdProductCrawler adds your met file extension to the absolute path of the
data file. Try that and see if it ingests the data file. I should have
noticed this before, but I only caught it after testing it out.

Next I appied an option for metadata crawler  to the run script.
> #!/bin/sh
> export STAGE_AREA=/home/yhkang/oodt-0.5/cas-pushpull/staging/TESL2CO2
> ./crawler_launcher \
>        -op    -metPC\
>        -pp $STAGE_AREA\
>        -fm http://localhost:8000\
>        -mxc ../policy/crawler-config.xml\
>        -mx org.apache.oodt.cas.metadata.extractors.ExternMetExtractor\
>        -mxr ../policy/mime-extractor-map.xml\
>        --failureDir /tmp \
>        --actionIds DeleteDataFile MoveDataFileToFailureDir Unique \
>        --metFileExtension tmp \
>        --clientTransferer
> org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory
>
> I also get the error message as follows:
>
> ERROR: Failed to launch crawler : Error creating bean with name
> 'MetExtractorProductCrawler' defined in file
>
> [/home/yhkang/oodt-0.5/cas-crawler-0.5-SNAPSHOT/bin/../policy/crawler-beans.xml]:
> Error setting property values; nested exception is
> org.springframework.beans.PropertyBatchUpdateException; nested
> PropertyAccessExceptions (1) are:
> PropertyAccessException 1:
> org.springframework.beans.MethodInvocationException: Property
> 'metExtractor' threw exception; nested exception is
> org.apache.oodt.cas.metadata.exceptions.MetExtractionException: Failed
> to parse config file : Failed to parser
> '/home/yhkang/oodt-0.5/cas-crawler-0.5-SNAPSHOT/policy/crawler-config.xml'
> : null
>
> I just used the property file crawler-config.xml (as follows) in the
> policy directory.
>
> <beans xmlns="http://www.springframework.org/schema/beans"
>         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
> xmlns:p="http://www.springframework.org/schema/p"
>         xsi:schemaLocation="http://www.springframework.org/schema/beans
> http://www.springframework.org/schema/beans/spring-beans-2.5.xsd">
>         <bean
> class="org.apache.oodt.cas.crawl.util.CasPropertyOverrideConfigurer"
> />
>         <import resource="crawler-beans.xml" />
>         <import resource="action-beans.xml" />
>         <import resource="precondition-beans.xml" />
>         <import resource="naming-beans.xml" />
> </beans>
>
>

Your metextractor config (-mxc option) should be a config file for your
external meta-extractor and will look like this :
https://svn.apache.org/repos/asf/oodt/trunk/metadata/src/main/resources/examples/extern-config.xml

The crawler-config.xml is used by the crawler-launcher to read all the
actions, precondition etc.

I've not defined or used an external-met extractor before, but you can see
an example of an extern met-extractor and it's config in the wiki:
https://cwiki.apache.org/confluence/display/OODT/OODT+Crawler+Help

So I need to understand how to write some xml files(including
> crawler-beans.xml, action-beans.xml, etc), which are imported into the
> file  crawler-config.xml .
> Could you share your experience with me ?
> Thanks,
> Yunhee
>
>
Yep, you should write the above mentioned extractor config file for your
specific external met-extractor. But, you don't have to write crawler-beans
or action-beans. You can just pick the actions ids you want in the
crawler-launcher cli '-actionIds or -ais' option and you can see these
listed in the action-beans.xml. The same applies for the crawler-beans and
the preconditions.

2012/8/10 Sheryl John <sheryljj@gmail.com>:
> > Hi Yunhee,
> >
> > What are the error messages you get while running the crawler?
> >
> > I've faced similar issues with crawler when I tried out the first time
> too.
> > I went through the crawler user guide to understand the architecture and
> > then understood how it worked only after running crawler with several
> times
> > to ingest files.
> > I agree we need to update the guide and if you want to know about the
> > MetExtractorProductCrawler and AutoDetectProductCrawler, the wiki page
> that
> > I mentioned before will give you an idea how to get it working (It
> mentions
> > the config files that you need to write for the above two crawlers).
> >
> >
> >
> > On Thu, Aug 9, 2012 at 6:27 AM, YunHee Kang <yunh.kang@gmail.com> wrote:
> >
> >> Hi Chris,
> >>
> >> I got a bunch of error messages when running the crawler_launcher
> script.
> >> First off, I think I need to understand  how to a crawler works.
> >> Can I get some materials to help me write configuration files for
> >> crawler_launcher ?
> >>
> >> Honestly I am not familiar with Crawler.
> >> But I will try to file a JIRA issue to update the Crawler user guide.
> >>
> >> Thanks,
> >> Yunhee
> >>
> >>
> >>
> >> 2012/8/9 Mattmann, Chris A (388J) <chris.a.mattmann@jpl.nasa.gov>:
> >> > Hi YunHee,
> >> >
> >> > Sorry, we need to update the docs, that is for sure. Can you help
> >> > us remember by filing a JIRA issue to update the Crawler user
> >> > guide and to fix the URL there?
> >> >
> >> > As for crawlerId, yes it's obsolete, you can find the modern
> >> > 0.4 and 0.5-trunk options by running ./crawler_launcher -h
> >> >
> >> > Cheers,
> >> > Chris
> >> >
> >> > On Aug 7, 2012, at 7:03 AM, YunHee Kang wrote:
> >> >
> >> >> Hi Chris and Sheryl,
> >> >>
> >> >> I understood  my mistake after modifying a wrong URL with the "/".
> >> >> But there is the wrong  URL  that is used  as an option of
> >> >> crawler_launcher in the apache oodt
> >> >> homepage(http://oodt.apache.org/components/maven/crawler/user/).
> >> >> --filemgrUrl http://localhost:9000/ \
> >> >> So it made me confused.
> >> >>
> >> >> I tried to run the command mentioned below  according to  the home
> >> >> page of apache oodt.
> >> >> $ ./crawler_launcher --crawlerId MetExtractorProductCrawler
> >> >> ERROR: Invalid option: 'crawlerId'
> >> >>
> >> >> But the error described above  was occurred.
> >> >> Is the option 'crawlerid'  obsolete ?
> >> >>
> >> >> Thanks,
> >> >> Yunhee
> >> >>
> >> >>
> >> >> 2012/8/7 Mattmann, Chris A (388J) <chris.a.mattmann@jpl.nasa.gov>:
> >> >>> Perfect, Sheryl, my thoughts exactly.
> >> >>>
> >> >>> Cheers,
> >> >>> Chris
> >> >>>
> >> >>> On Aug 6, 2012, at 10:01 AM, Sheryl John wrote:
> >> >>>
> >> >>>> Hi Yunhee,
> >> >>>>
> >> >>>> Check out this OODT wiki for crawler :
> >> >>>> https://cwiki.apache.org/confluence/display/OODT/OODT+Crawler+Help
> >> >>>>
> >> >>>> Did you try giving 'http://localhost:8000' without the "/"
in the
> >> end?
> >> >>>> Also, specify
> >> 'org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory'
> >> >>>> for  'clientTransferer' option.
> >> >>>>
> >> >>>>
> >> >>>> On Mon, Aug 6, 2012 at 9:46 AM, YunHee Kang <yunh.kang@gmail.com>
> >> wrote:
> >> >>>>
> >> >>>>> Hi Chris,
> >> >>>>>
> >> >>>>> I got an error message when I tried to run crawler_launcher
by
> using
> >> a
> >> >>>>> shell script. The error message may be caused by a  wrong
URL of
> >> >>>>> filemgr.
> >> >>>>> $ ./crawler_launcher.sh
> >> >>>>> ERROR: Validation Failures: - Value 'http://localhost:8000/'
is
> not
> >> >>>>> allowed for option
> >> >>>>> [longOption='filemgrUrl',shortOption='fm',description='File
> Manager
> >> >>>>> URL'] - Allowed values = [http://.*:\d*]
> >> >>>>>
> >> >>>>> The following is the shell script that I wrote:
> >> >>>>> $ cat crawler_launcher.sh
> >> >>>>> #!/bin/sh
> >> >>>>> export
> STAGE_AREA=/home/yhkang/oodt-0.5/cas-pushpull/staging/TESL2CO2
> >> >>>>> ./crawler_launcher \
> >> >>>>>      -op --launchStdCrawler \
> >> >>>>>      --productPath $STAGE_AREA\
> >> >>>>>      --filemgrUrl http://localhost:8000/\
> >> >>>>>      --failureDir /tmp \
> >> >>>>>      --actionIds DeleteDataFile MoveDataFileToFailureDir
Unique \
> >> >>>>>      --metFileExtension tmp \
> >> >>>>>      --clientTransferer
> >> >>>>> org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferer
> >> >>>>>
> >> >>>>> I am wondering if there is a problem in the URL of the
filemgr or
> >> elsewhere
> >> >>>>>
> >> >>>>> Thanks,
> >> >>>>> Yunhee
> >> >>>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> --
> >> >>>> -Sheryl
> >> >>>
> >> >>>
> >> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> >>> Chris Mattmann, Ph.D.
> >> >>> Senior Computer Scientist
> >> >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> >>> Office: 171-266B, Mailstop: 171-246
> >> >>> Email: chris.a.mattmann@nasa.gov
> >> >>> WWW:   http://sunset.usc.edu/~mattmann/
> >> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> >>> Adjunct Assistant Professor, Computer Science Department
> >> >>> University of Southern California, Los Angeles, CA 90089 USA
> >> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> >>>
> >> >
> >> >
> >> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> > Chris Mattmann, Ph.D.
> >> > Senior Computer Scientist
> >> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> > Office: 171-266B, Mailstop: 171-246
> >> > Email: chris.a.mattmann@nasa.gov
> >> > WWW:   http://sunset.usc.edu/~mattmann/
> >> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> > Adjunct Assistant Professor, Computer Science Department
> >> > University of Southern California, Los Angeles, CA 90089 USA
> >> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> >
> >>
> >
> >
> >
> > --
> > -Sheryl
>



-- 
-Sheryl

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message