Am 03.12.2011 08:32, schrieb mina:
> hi, i want to give nutch several sites and nutch crawl them. for example i
> want nutch crawl:
> http://www.site1.com
> http://www.site2.com
> http://www.site3.com
> how can i do that?help me.
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/how-give-several-sites-to-nutch-to-crawl-tp3556697p3556697.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
1.) Make a dir called e.g. "seedUrls" and add a plain text file with all
the sites you want to crawl
2.) Add:
+^http://www.site1.com
+^http://www.site2.com
...
+^http://www.siteN.com
to your regex-urlfilter.txt in order to allow these urls to be crawled
3.) call the inject command (./nutch inject <crawldb> <url_dir>) where
<crawldb> is the name for your new crawldb and <url_dir> the directory
of the seed urls, in my example "seedUrls"
Then you can call the generator, fetcher, parser and updater for a crawl
cycle.
Hope that helps for the start. :)
|