nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhou LiBing <zhoulib...@gmail.com>
Subject Re: [Nutch-general] RE : Crawl some sites
Date Thu, 12 May 2005 00:49:19 GMT
If I want to crawl the whole WWW but I don't use the DMOZ data,What should 
Ido?
 

 On 5/12/05, Jean-Luc <jean-luc@eserver.hopto.org> wrote: 
> 
> *This message was transferred with a trial version of CommuniGate(tm) Pro*
> Use this command line to inject url's to your existing db:
> nutch inject db -urlfile sites.txt
> 
> Work's for me :)
> 
> -----Message d'origine-----
> De : Ian Reardon [mailto:irnutch@gmail.com]
> Envoyé : mercredi 11 mai 2005 00:02
> À : nutch-user@incubator.apache.org
> Objet : Crawl some sites
> 
> I would like to crawl some specific sites with nutch for content. I
> will be physicaly looking for sites all the time and would like to add
> them to my index on a regular basis. So say I look around for sites to
> crawl and say add 1 or 2 a week. Can anyone psudo walk through this
> with me?
> 
> I crawled some sites with nutch by creating a flat file of URL's and
> then ran the crawl command, it created the directories/db's but I tried
> to add a new site after the crawl but I got an error about directory or
> DB already exists. Do I have to recrawl all my content every time I add
> something?? So say delete the folder, add the new site to my flat file
> and crawl them all over again? Thanks.
> 
> -------------------------------------------------------
> This SF.Net <http://SF.Net> email is sponsored by Oracle Space Sweepstakes
> Want to be the first software developer in space?
> Enter now for the Oracle Space Sweepstakes!
> http://ads.osdn.com/?ad_ids93&alloc_id281&opclick
> _______________________________________________
> Nutch-general mailing list
> Nutch-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-general
> 



-- 
---Letter From your friend Blue at HUST CGCL---
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message