nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean-Luc" <jean-...@eserver.hopto.org>
Subject RE : Crawl some sites
Date Wed, 11 May 2005 19:43:41 GMT
*This message was transferred with a trial version of CommuniGate(tm) Pro*
Use this command line to inject url's to your existing db:
nutch inject db -urlfile sites.txt

Work's for me :)




-----Message d'origine-----
De : Ian Reardon [mailto:irnutch@gmail.com] 
Envoyé : mercredi 11 mai 2005 00:02
À : nutch-user@incubator.apache.org
Objet : Crawl some sites

 I would like to crawl some specific sites with nutch for content. I
will be physicaly looking for sites all the time and would like to add
them to my index on a regular basis.  So say I look around for sites to
crawl and say add 1 or 2 a week.  Can anyone psudo walk through this
with me?

I crawled some sites with nutch by creating a flat file of URL's and
then ran the crawl command, it created the directories/db's but I tried
to add a new site after the crawl but I got an error about directory or
DB already exists.  Do I have to recrawl all my content every time I add
something?? So say delete the folder, add the new site to my flat file
and crawl them all over again?  Thanks.




Mime
View raw message