manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erlend GarĂ¥sen <e.f.gara...@usit.uio.no>
Subject Re: Hop count problem
Date Mon, 12 Aug 2013 11:43:21 GMT

Thanks Karl,

Maybe some documents became unreachable at the time I tried to reproduce 
some problems I had with this host for some months agp. But the thing is 
that our test environment also crawls 50% more documents for other jobs 
as well. This might be due to unreachable documents.

What is the best approach to tell MCF that all documents should be 
processed again? Manually delete some tables from the database?

Erlend

On 8/12/13 1:31 PM, Karl Wright wrote:
> Hi Erlend,
>
> If any link in the chain from the seed to the document is broken, a
> document reachable on a previous crawl can become unreachable and thus
> report "Hop count exceeded".  In this case, the document must have been
> queued somehow - or must have been present from a previous crawl.
>
> So, for example, suppose you have this chain:
>
> A->B->C
>
> ... and then all of a sudden, B cannot be fetched.  Then, C will report
> that its hopcount is exceeded.
>
> Based on your report that the test environment works OK, and the
> production environment does not, I expect there is something like this
> going on.  I know you attempted to fetch the intervening document from
> your test environment, but it is conceivable that the production
> environment is unable to get it.  You should see evidence of that in the
> simple history, if so.
>
> I can try a sample crawl from home tonight if you like, and we can see
> whether I get the reduced set or the complete one.  However, bear in
> mind that hopcount is one of MCF's most rigorously tested features, so I
> personally doubt there is a problem with the hopcount logic per se.
>
> Thanks,
> Karl
>
>
>
> On Mon, Aug 12, 2013 at 6:39 AM, Erlend GarĂ¥sen <e.f.garasen@usit.uio.no
> <mailto:e.f.garasen@usit.uio.no>> wrote:
>
>
>     I have discovered an odd thing regarding hop counts. Our prod
>     environment crawls a lot fewer documents compared to our test
>     environment even though the configuration is exactly the same. Then
>     I figured out that several documents which are expected to be
>     fetched are, according to MCF, outside the hop count limit, but
>     they're not.
>
>     This can be reproduced by using a small job for one particular host,
>     www.ibsen.uio.no <http://www.ibsen.uio.no>. The seed list is as follows:
>
>     http://www.ibsen.uio.no/
>
>     Hop filter settings are:
>     link: 6
>     redirect: 3
>
>     Only these two documents are fetched:
>     http://www.ibsen.uio.no/__forside.xhtml
>     <http://www.ibsen.uio.no/forside.xhtml>
>     http://www.ibsen.uio.no/
>
>     Here's what MCF says about one omitted document, i.e.,
>     http://www.ibsen.uio.no/__skuespill.xhtml
>     <http://www.ibsen.uio.no/skuespill.xhtml>:
>     State: out of scope
>     Status: Hopcount exceeded
>
>     This is odd. If you open up www.ibsen.uio.no
>     <http://www.ibsen.uio.no>, you can see that the link
>     "http://www.ibsen.uio.no/__skuespill.xhtml
>     <http://www.ibsen.uio.no/skuespill.xhtml>" (Skuespill) appears on
>     the main page.
>
>     Our test environment fetches this document without problems.
>
>     Erlend
>
>


Mime
View raw message