nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marek Bachmann <m.bachm...@uni-kassel.de>
Subject Re: how give several sites to nutch to crawl?
Date Sat, 03 Dec 2011 15:52:49 GMT
Am 03.12.2011 08:32, schrieb mina:
> hi, i want to give nutch several sites and nutch crawl them. for example i
> want nutch crawl:
> http://www.site1.com
> http://www.site2.com
> http://www.site3.com
> how can i do that?help me.
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/how-give-several-sites-to-nutch-to-crawl-tp3556697p3556697.html
> Sent from the Nutch - User mailing list archive at Nabble.com.

1.) Make a dir called e.g. "seedUrls" and add a plain text file with all
the sites you want to crawl
2.) Add:
	+^http://www.site1.com
	+^http://www.site2.com
	...
	+^http://www.siteN.com
to your regex-urlfilter.txt in order to allow these urls to be crawled

3.) call the inject command (./nutch inject <crawldb> <url_dir>) where
<crawldb> is the name for your new crawldb and <url_dir> the directory
of the seed urls, in my example "seedUrls"

Then you can call the generator, fetcher, parser and updater for a crawl
cycle.

Hope that helps for the start. :)

Mime
View raw message