Nutch 1.14:
I am looking at the FetcherThread code. The 404 url does get flagged with
a ProtocolStatus.NOTFOUND, but the broken link never gets to the crawldb.
It does however got into the linkdb. Please tell me how I can collect these
404 urls.
Any help would be appreciated,
.,..bob
case ProtocolStatus.NOTFOUND:
case ProtocolStatus.GONE: // gone
case ProtocolStatus.ACCESS_DENIED:
case ProtocolStatus.ROBOTS_DENIED:
output(fit.url, fit.datum, null, status,
CrawlDatum.STATUS_FETCH_GONE); // broken link is
getting here
break;
On Fri, Feb 28, 2020 at 12:06 PM Robert Scavilla <rscavilla@gmail.com>
wrote:
> Hi again, and thank you in advance for your kind help.
>
> I'm using Nutch 1.14
>
> I'm trying to use nutch to find broken links (404s) on a site. I
> followed the instructions:
> bin/nutch readdb <crawlFolder>/crawldb/ -dump myDump
>
> but the dump only shows 200 and 301 status. There is no sign of any broken
> link. When enter just 1 broken link in the seed file the crawldb is empty.
>
> Please advise how I can inspect broken links with nutch1.14
>
> Thank you!
> ...bob
>
|