nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Scavilla <rscavi...@gmail.com>
Subject Re: finding broken links with nutch 1.14
Date Mon, 02 Mar 2020 22:11:14 GMT
Nutch 1.14:
I am looking at the FetcherThread code. The 404 url does get flagged with
a ProtocolStatus.NOTFOUND, but the broken link never gets to the crawldb.
It does however got into the linkdb. Please tell me how I can collect these
404 urls.

Any help would be appreciated,
.,..bob

           case ProtocolStatus.NOTFOUND:
            case ProtocolStatus.GONE: // gone
            case ProtocolStatus.ACCESS_DENIED:
            case ProtocolStatus.ROBOTS_DENIED:
              output(fit.url, fit.datum, null, status,
                  CrawlDatum.STATUS_FETCH_GONE);     // broken link is
getting here
              break;

On Fri, Feb 28, 2020 at 12:06 PM Robert Scavilla <rscavilla@gmail.com>
wrote:

> Hi again, and thank you in advance for your kind help.
>
> I'm using Nutch 1.14
>
> I'm trying to use nutch to find broken links (404s) on a site. I
> followed the instructions:
> bin/nutch readdb <crawlFolder>/crawldb/ -dump myDump
>
> but the dump only shows 200 and 301 status. There is no sign of any broken
> link. When enter just 1 broken link in the seed file the crawldb is empty.
>
> Please advise how I can inspect broken links with nutch1.14
>
> Thank you!
> ...bob
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message