nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject Re: Best strategy for boundary defined crawling
Date Fri, 02 Dec 2011 15:54:56 GMT
I'm not sure what you mean; you want a set of domains, crawl them but never 
add new domains to the DB? You can set ignore.external.links to true, this is 
an easy method to restrict the DB to one or more specific domains.

On Friday 02 December 2011 16:23:53 contacts@complexityintelligence.com wrote:
> Hello,
> 
> 
>    We want to crawl the DMOZ set of web sites, and only this set. Which
> is the best
> strategy to use with Nutch ?
> 
> 
>    I'm new with Nutch, and I'm comparing it with our in-house crawling
> solution, we may switch
> to Nutch if this test is ok.
> 
> 
>    I think that the trivial solution is something like:
> 
> 
>      - Parse the DMOZ content file and extract the seed urls (all urls)
>      - Use an regex-url filter adding one entry for each url in the seed
> file
>      - I hope that some options exists to limit the crawl inside the
> space of each domain,
>         and of course, skipping outbound (to different domain) links.
> 
> 
> 
>    I think that file based regex url is not a good solution. If I have a
> database, even in Java,
>    like HSQLDB or H2 with all regex url filter entry, can I use a db
> instead of file ? Writing a plug-in
>    is not a problem, if needed.
> 
> 
> 
> Thanks,
> Alessio

-- 
Markus Jelsma - CTO - Openindex

Mime
View raw message