manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Hop count problem
Date Mon, 12 Aug 2013 11:31:30 GMT
Hi Erlend,

If any link in the chain from the seed to the document is broken, a
document reachable on a previous crawl can become unreachable and thus
report "Hop count exceeded".  In this case, the document must have been
queued somehow - or must have been present from a previous crawl.

So, for example, suppose you have this chain:

A->B->C

... and then all of a sudden, B cannot be fetched.  Then, C will report
that its hopcount is exceeded.

Based on your report that the test environment works OK, and the production
environment does not, I expect there is something like this going on.  I
know you attempted to fetch the intervening document from your test
environment, but it is conceivable that the production environment is
unable to get it.  You should see evidence of that in the simple history,
if so.

I can try a sample crawl from home tonight if you like, and we can see
whether I get the reduced set or the complete one.  However, bear in mind
that hopcount is one of MCF's most rigorously tested features, so I
personally doubt there is a problem with the hopcount logic per se.

Thanks,
Karl



On Mon, Aug 12, 2013 at 6:39 AM, Erlend GarĂ¥sen <e.f.garasen@usit.uio.no>wrote:

>
> I have discovered an odd thing regarding hop counts. Our prod environment
> crawls a lot fewer documents compared to our test environment even though
> the configuration is exactly the same. Then I figured out that several
> documents which are expected to be fetched are, according to MCF, outside
> the hop count limit, but they're not.
>
> This can be reproduced by using a small job for one particular host,
> www.ibsen.uio.no. The seed list is as follows:
>
> http://www.ibsen.uio.no/
>
> Hop filter settings are:
> link: 6
> redirect: 3
>
> Only these two documents are fetched:
> http://www.ibsen.uio.no/**forside.xhtml<http://www.ibsen.uio.no/forside.xhtml>
> http://www.ibsen.uio.no/
>
> Here's what MCF says about one omitted document, i.e.,
> http://www.ibsen.uio.no/**skuespill.xhtml<http://www.ibsen.uio.no/skuespill.xhtml>
> :
> State: out of scope
> Status: Hopcount exceeded
>
> This is odd. If you open up www.ibsen.uio.no, you can see that the link "
> http://www.ibsen.uio.no/**skuespill.xhtml<http://www.ibsen.uio.no/skuespill.xhtml>"
> (Skuespill) appears on the main page.
>
> Our test environment fetches this document without problems.
>
> Erlend
>

Mime
View raw message