nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marek Bachmann <>
Subject Re: how give several sites to nutch to crawl?
Date Sat, 03 Dec 2011 15:52:49 GMT
Am 03.12.2011 08:32, schrieb mina:
> hi, i want to give nutch several sites and nutch crawl them. for example i
> want nutch crawl:
> how can i do that?help me.
> --
> View this message in context:
> Sent from the Nutch - User mailing list archive at

1.) Make a dir called e.g. "seedUrls" and add a plain text file with all
the sites you want to crawl
2.) Add:
to your regex-urlfilter.txt in order to allow these urls to be crawled

3.) call the inject command (./nutch inject <crawldb> <url_dir>) where
<crawldb> is the name for your new crawldb and <url_dir> the directory
of the seed urls, in my example "seedUrls"

Then you can call the generator, fetcher, parser and updater for a crawl

Hope that helps for the start. :)

View raw message