nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject Re: purging 404 URLs with SolrClean
Date Thu, 14 Jul 2011 21:29:44 GMT
Yes, it will send delete commands all the time, a shortcoming indeed. You can 
open an issue and perhaps add a patch. The problem is that, as far as i know, 
there is no status in CrawlDatum for this situation.

Maybe it would be a better idea to add a -since parameter to only delete items 
after a specific timestamp.

Anyway, it's not a really big deal unless you delete many thousands of items 
very frequently and issue a subsequent commit. Especially the last is costly.

Cheers

> I've noticed that SolrClean does not mark URLs as purged from Solr. Will
> running the SolrClean task multiple times send the same URLs to Solr for
> deletion? If so, what is the best strategy to mark these documents in the
> crawl DB so they are repeatedly deleted from Solr?
> 
> Blessings,
> TwP

Mime
View raw message