oodt-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Verma, Rishi (388J)" <Rishi.Ve...@jpl.nasa.gov>
Subject Re: Registering a custom ProductCrawler with cas-crawler
Date Fri, 27 Apr 2012 18:37:19 GMT
Hey All,

Chris and I had an lively discussion over IM, about the topic of whether
to write a custom crawler or use actionIds/precondId based extension
points.

We thought it would be useful to share, so I've made it available on the
OODT wiki:
https://cwiki.apache.org/confluence/display/OODT/2012/04/27/Custom+crawling
+-+when+to+or+when+not+to+write+your+own+ProductCrawler


Thanks!
rishi

On 4/26/12 1:25 PM, "Verma, Rishi (388J)" <Rishi.Verma@jpl.nasa.gov> wrote:

>Per Chris' suggestion, I'm looking at making a custom pre-ingest action or
>pre-ingest comparator instead of creating a full new productcrawler. This
>might be a more light-weight solution.
>
>However, thanks for the tips in any case Brian and Chris!
>
>rishi
>
>On 4/26/12 2:06 AM, "Brian Foster" <holenoter@me.com> wrote:
>
>>Nevermind... Looks like you are using 0.3 instead of the trunk... what I
>>added applies to trunk crawler
>>
>>-Brian
>>
>>On Apr 25, 2012, at 4:36 PM, "Verma, Rishi (388J)"
>><Rishi.Verma@jpl.nasa.gov> wrote:
>>
>>> Hi all,
>>> 
>>> I wrote a custom cas-crawler ProductCrawler, but I'm having some
>>>difficulty registering my custom product crawler with cas-crawler.
>>> 
>>> I created a product crawler by extending StdProductCrawler, and I've
>>>added this product-crawler name to crawler config files (following the
>>>example of StdProductCrawler):
>>> * crawler/policy/crawler-beans.xml
>>> * crawler/policy/cmd-line-option-beans.xml
>>> 
>>> However, after running the below command, I can clearly see my custom
>>>product crawler (called LabCASProductCrawler) is not available. A
>>>crawler ingest try also tells me that there is no "bean" by the name of
>>>my "LabCASProductCrawler" available:
>>>> bash-3.2$ ./crawler_launcher ‹printSupportedCrawlers
>>> ProductCrawlers:
>>>  Id: StdProductCrawler
>>>  Id: MetExtractorProductCrawler
>>>  Id: AutoDetectProductCrawler
>>> 
>>>> ./crawler_launcher --crawlerId LabCASProductCrawler --filemgrUrl
>>>>http://localhost:9000 --productPath /data/staging/HGHAGA9 --failureDir
>>>>/tmp/failed_ingest --metFileExtension met ‹clientTransferer
>>>>org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory
>>> Failed to parse options : No bean named 'LabCASProductCrawler' is
>>>defined
>>> 
>>> I noticed in files like crawler-config.xml and
>>>cmd-line-option-beans.xml, there were references made to crawler config
>>>files stored in the cas-crawler JAR. Looking more into this, it seems to
>>>me that crawler is pre-loading config files directly from that JAR and
>>>overshadowing any of my config changes:
>>> * 
>>>crawler/lib/cas-crawler-0.3.jar:org/apache/oodt/cas/crawl/crawler-beans.
>>>x
>>>ml
>>> * 
>>>crawler/lib/cas-crawler-0.3.jar:org/apache/oodt/cas/crawl/crawler-config
>>>.
>>>xml
>>> 
>>> So two questions:
>>> 1. Am I editing the correct policy files, in order to register my
>>>custom product crawler with cas-crawler?
>>> 2. It seems the cas-crawler JAR contains crawler config files that take
>>>greater precedence than the ones available for editing under
>>>crawler/policy. Is there a way around this?
>>> 
>>> Thanks!
>>> rishi
>

Mime
View raw message