oodt-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Verma, Rishi (388J)" <Rishi.Ve...@jpl.nasa.gov>
Subject Registering a custom ProductCrawler with cas-crawler
Date Wed, 25 Apr 2012 23:36:00 GMT
Hi all,

I wrote a custom cas-crawler ProductCrawler, but I'm having some difficulty registering my
custom product crawler with cas-crawler.

I created a product crawler by extending StdProductCrawler, and I've added this product-crawler
name to crawler config files (following the example of StdProductCrawler):
* crawler/policy/crawler-beans.xml
* crawler/policy/cmd-line-option-beans.xml

However, after running the below command, I can clearly see my custom product crawler (called
LabCASProductCrawler) is not available. A crawler ingest try also tells me that there is no
"bean" by the name of my "LabCASProductCrawler" available:
> bash-3.2$ ./crawler_launcher —printSupportedCrawlers
  Id: StdProductCrawler
  Id: MetExtractorProductCrawler
  Id: AutoDetectProductCrawler

> ./crawler_launcher --crawlerId LabCASProductCrawler --filemgrUrl http://localhost:9000
--productPath /data/staging/HGHAGA9 --failureDir /tmp/failed_ingest --metFileExtension met
—clientTransferer org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory
Failed to parse options : No bean named 'LabCASProductCrawler' is defined

I noticed in files like crawler-config.xml and cmd-line-option-beans.xml, there were references
made to crawler config files stored in the cas-crawler JAR. Looking more into this, it seems
to me that crawler is pre-loading config files directly from that JAR and overshadowing any
of my config changes:
* crawler/lib/cas-crawler-0.3.jar:org/apache/oodt/cas/crawl/crawler-beans.xml
* crawler/lib/cas-crawler-0.3.jar:org/apache/oodt/cas/crawl/crawler-config.xml

So two questions:
1. Am I editing the correct policy files, in order to register my custom product crawler with
2. It seems the cas-crawler JAR contains crawler config files that take greater precedence
than the ones available for editing under crawler/policy. Is there a way around this?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message