nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dennis Kubes <ku...@apache.org>
Subject Re: Fetching only unfetched URLs
Date Thu, 04 Dec 2008 13:40:43 GMT


Otis Gospodnetic wrote:
> Hi,
> 
> If there an existing method for generating a segment/fetchlist containing only URLs that
have not yet been fetched?
> I'm asking because I can imagine a situation where one has a large and "old" CrawlDb
that "knows" about a lot of URLs (the ones with "db_unfetched" status if you run -stats) and
in such a situation a person may prefer to fetch only the yet-unfetched URLs first, and only
after that include URLs that need to be refetched in the newly generated segments.
> 

I don't think a current method exists to do only unfetched URLs, but it 
does sound like an interesting bit of functionality.

> One can write a custom Generator, or perhaps modify the existing one to add this option,
but is there an existing mechanism for this?

Generator would probably be best, let me look into what it would take to 
do this.  Maybe we can get it into 1.0.

Dennis

> 
> If not, does this sound like something that should be added to the existing Generator
and invoked via a command-line arg, say -unfetchedOnly ?
> 
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 

Mime
View raw message