nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dennis Kubes <>
Subject Re: focused crawls -- where to add parse filter
Date Mon, 19 Feb 2007 03:12:54 GMT
Brian Whitman wrote:
>> How about an outlink filter that works during parse? In 
>> ParseOutputFormat,
>> it will take the parse text, parse data (etc.) of the source page and
>> the destination url then will either return "filter this outlink" or
>> "let it through".
>> Write an HtmlParseFilter that sets an attribute in the ParseData 
>> MetaData based on whether the page contains what you are looking for. 
>> Then write another MR job that runs after the crawl/index cycle.  This 
>> job would need to update the CrawlDatum MetaData based on your 
>> priority calculation (inlinks and contains text, etc.).  Then hack the 
>> Generator class around line 160 to change the sort value that it is 
>> using based on the CrawlDatum MetaData.  I would make using this new 
>> sort value an option that you can turn on and off by using different 
>> configuration values.
> Hi Doğacan, Dennis:
> Thanks for the ideas. I spent some time mentally planning out how to 
> implement both of these ideas by looking at the source. I'm still newish 
> to Nutch so excuse my naiveté.
> Do either of these approaches let me get at the analyzed/indexed 
> contents of the page text so that I can perform Lucene queries for 
> filtering? What I could tell of the HtmlParseFilter or Parse in general 
> is that it gets me at the parse tree, which i could do regexp queries on 
> -- but I'd rather it be all in Lucene and be influenced by the relative 
> ranking of terms amongst all documents. I am envisioning machine 
> generated queries from our classifiers that might be hundreds of tokens 
> long with boost values per term, and a score threshold. So I'd need to 
> act on the documents post-index. Unless I'm reading your suggestions 
> incorrectly, neither of them let me at that?

You could drop the HtmlParseFilter part and simply write the post 
crawl/index MR job after to update the CrawlDatum based on your lucene 
queries.  You would still need to write the second part that does the 
generation based on a different sort value.
> I am currently looking at PruneIndexTool -- could a modification of this 
> work? I could run it after a crawl/index cycle but before invertlinks 
> and the next generate. The one issue I see is that PruneIndexTool claims 
> not to affect the WebDB. Does this mean that even though the lucene doc 
> will be gone, the link and outlinks will remain in the WebDB and will be 
> fetched anyway?

That is correct.  You will need to alter the CrawlDb to affect what is 
generated and hence fetched.
> If I should instead be looking harder at your recommended 
> HtmlParseFilter or ParseOutputFormat, please correct me.

No if you are doing complex queries instead of something like "if this 
page contains words x, y, and z"  then I wouldn't do it through 
HtmlParseFilter I would probably go with the lucene after index approach.

Dennis Kubes
> -Brian

View raw message