nutch-agent mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Whelan <j...@whelanlabs.com>
Subject Re: url filters
Date Sat, 28 Mar 2009 21:42:20 GMT

Filtering would be one solution... You would set your filter creteria to
match your pages. Another approach is to set the traversal depth so that
only the primary pages (listed in your urls.txt file) are hit, and nothing
deeper is crawled.



Pierre-Luc Bacon wrote:
> 
> I wish to use Nutch so that it would crawl the urls contained into a
> file (let say urls/urls.txt) but would stay only within these. I have
> been using Nutch for a few weeks now but it bothers me to see that the
> crawler goes visiting the ads on websites and indexes their content.
> Most of the time, the crawler ends up analysing some content about
> "free ipod, discount stuff and traveltoBananaIsland.com" related sites
> while I'm not interested at all having those in the index.
> 
> I know that conf/crawl-urlfilter.txt could be used to that purpose but
> I was wondering if there would be a single line in a conf file that
> would turn a such feature on. I would prefer avoiding to do regexp and
> just care about feeding the crawler plain urls.
> 
> 

-- 
View this message in context: http://www.nabble.com/url-filters-tp8938763p22761671.html
Sent from the Nutch - Agent mailing list archive at Nabble.com.


Mime
View raw message