nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ML mail <>
Subject Re: topN question
Date Tue, 18 Nov 2008 21:23:01 GMT
> It is because some urls are not fetched either because they
> are bad, the site it currently down, etc.  The 3000 number
> is the number of urls that you start with.  The number you
> end up with is the number of urls that were successfully
> fetched.  Depending on the quality of the urls list
> generated I have seen bad url rates from 0% - 80%.  Usually
> though it is about 10% give or take.

This night I had nutch fetcher running with topN set to 100'000 and the index grew of 54'418
documents so this time it wasn't as bad. Maybe on a small sample (3000 like the other day)
one notices more such bad urls, or it was just bad luck.

> Yikes.  Yes you are dropping a lot of documents.  I would
> look through your fetching logs and see if you have a lot of
> timeout errors.  Maybe you are you maxing out your crawl
> bandwidth?  Maybe you just have a bad list of urls, but if
> this is a random list, that seems really high.

Nearly no timeouts, just a few or so. Crawling currently only uses 2 Mbit/s bandwidth because
I limited the threads to 50, so there is plenty of bandwidth left ;-) My source for the initial
crawl import was an extract of DMOZ's URLs for one single top level domain (around 100'000
urls). I guess DMOZ shouldn't be bad source.

> Redirects in a word.  The http.redirect.max conf variable
> is by default set to 0.  Which means that any url that sends
> a redirect will be fetched on the next crawl cycle and not
> immediately.  This alone may account for some of your crawl
> numbers.  Try setting that value to 3 in nutch-site.xml and
> see if your numbers don't improve.

Thanks for the tip, I have changed it to 3 and will see how it performs this night.
Best regards


View raw message