nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Scavilla <rscavi...@gmail.com>
Subject Re: finding broken links with nutch 1.14
Date Tue, 03 Mar 2020 18:56:51 GMT
Sebastian, I'm so sorry to have bothered you. I was following your email
and found a setting that was purging the 404 pages. It was set to true and
once set to false, all worked well!

Thank you,
...bob

        <property>
                <name>db.update.purge.404</name>
                <value>false</value>
                <description>If true, updatedb will add purge records with
status DB_GONE from the CrawlDB.</description>
        </property>

On Tue, Mar 3, 2020 at 3:57 AM Sebastian Nagel
<wastl.nagel@googlemail.com.invalid> wrote:

> Hi Robert,
>
> 404s are recorded in the CrawlDb after the tool "updatedb" is called.
> Could you share the commands you're running? Please also have a look into
> the log files (esp. the
> hadoop.log) - all fetches are logged and
> also whether fetches have failed. If you cannot find a log message
> for the broken links, it might be that the URLs are filtered. In this
> case, please also share the configuration (if different from the default).
>
> Best,
> Sebastian
>
> On 3/2/20 11:11 PM, Robert Scavilla wrote:
> > Nutch 1.14:
> > I am looking at the FetcherThread code. The 404 url does get flagged with
> > a ProtocolStatus.NOTFOUND, but the broken link never gets to the crawldb.
> > It does however got into the linkdb. Please tell me how I can collect
> these
> > 404 urls.
> >
> > Any help would be appreciated,
> > .,..bob
> >
> >            case ProtocolStatus.NOTFOUND:
> >             case ProtocolStatus.GONE: // gone
> >             case ProtocolStatus.ACCESS_DENIED:
> >             case ProtocolStatus.ROBOTS_DENIED:
> >               output(fit.url, fit.datum, null, status,
> >                   CrawlDatum.STATUS_FETCH_GONE);     // broken link is
> > getting here
> >               break;
> >
> > On Fri, Feb 28, 2020 at 12:06 PM Robert Scavilla <rscavilla@gmail.com>
> > wrote:
> >
> >> Hi again, and thank you in advance for your kind help.
> >>
> >> I'm using Nutch 1.14
> >>
> >> I'm trying to use nutch to find broken links (404s) on a site. I
> >> followed the instructions:
> >> bin/nutch readdb <crawlFolder>/crawldb/ -dump myDump
> >>
> >> but the dump only shows 200 and 301 status. There is no sign of any
> broken
> >> link. When enter just 1 broken link in the seed file the crawldb is
> empty.
> >>
> >> Please advise how I can inspect broken links with nutch1.14
> >>
> >> Thank you!
> >> ...bob
> >>
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message